1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
//! # Flush Buffer — Latch-Free I/O Buffer Ring
//!
//! This module implements LLAMA's in-memory write-staging layer: a fixed-size
//! ring of 4 KB-aligned [`FlushBuffer`]s that amortises individual page-state
//! writes into larger, sequential I/O operations before they are dispatched to
//! the [`LogStructuredStore`](crate::log_structured_store::LogStructuredStore).
//!
//! ## Design Goals
//!
//! | Goal | Mechanism |
//! |-------------------------|------------------------------------------------------------|
//! | Latch-free writes | Single packed [`AtomicUsize`] state word per buffer |
//! | `O_DIRECT` compatibility| 4 KB-aligned allocation via [`Buffer::new_aligned`] |
//! | Amortised I/O | Multiple threads fill one buffer before it is flushed |
//! | All threads participate | Any thread may seal or initiate a flush |
//!
//! ## Flush Protocol
//!
//! Adapted from the LLAMA paper; all steps are performed without global locks:
//!
//! 1. **Identify** the page state to be written.
//! 2. **Seize** space in the active [`FlushBuffer`] via
//! [`reserve_space`](FlushBuffer::reserve_space) — an atomic fetch-and-add
//! on the packed state word claims a non-overlapping byte range.
//! 3. **Check** atomically whether the reservation succeeded. If the buffer is
//! already sealed or the space is exhausted, the buffer is sealed and the ring
//! rotates to the next available slot.
//! 4. **Write** the payload into the reserved range while the flush-in-progress
//! bit prevents the buffer from being dispatched to stable storage prematurely.
//! 5. **On failure** at step 3, write a "Failed Flush" sentinel into the reserved
//! space. This wastes a few bytes but removes all ambiguity about which writes
//! succeeded.
//!
//! Though the currently implementation delegates the handling of all erroneous and invalid
//! states to the caller, the current implementation of the Flush proceedure should lend itself
//! well to to LLAMA flushing protocol
//!
//! ## State Word Layout
//!
//! All per-buffer metadata is packed into a single [`AtomicUsize`], making every
//! state snapshot self-consistent and eliminating TOCTOU (time of check/time of use) races between the
//! fields:
//!
//! ```text
//! ┌────────────────┬────────────────┬──────────────────┬───────────────────┬──────────┐
//! │ Bits 63..32 │ Bits 31..8 │ Bits 7..2 │ Bit 1 │ Bit 0 │
//! │ write offset │ writer count │ (reserved) │ flush-in-prog │ sealed │
//! └────────────────┴────────────────┴──────────────────┴───────────────────┴──────────┘
//! ```
//!
//! * **write offset** — next free byte position inside the backing allocation.
//! * **writer count** — number of threads that have reserved space but not yet finished
//! copying their payload.
//! * **flush-in-progress** — set by whichever thread wins the CAS race to own the
//! flush; prevents a second flush from being fired while the first is in flight.
//! * **sealed** — set when the buffer is full or explicitly closed; prevents new
//! reservations.
//!
//! Bits 7..2 represent unused space
pub use crate;
pub use crate;
pub use crateState;
use ;
use Entry;
use ;
/// A 4 KB-aligned, heap-allocated byte buffer suitable for `O_DIRECT` I/O.
///
/// `Buffer` owns a single contiguous allocation that is aligned to a
/// FOUR_KB_BLOCK (4 096 bytes) — the minimum alignment required by
/// `O_DIRECT` on all common block devices.
///
/// Cursor management is **not** handled here. Instead, [`FlushBuffer`] uses
/// atomic fetch-and-add on its packed state word to hand out non-overlapping
/// byte ranges to concurrent writers. This is what makes the
/// `unsafe impl Sync` sound: no two threads are ever granted the same region.
///
/// # Safety
///
/// [`Sync`] is manually implemented because [`UnsafeCell`] opts out of it by
/// default. The invariant that upholds this is: all mutable access to the
/// inner pointer is mediated by [`FlushBuffer`], which guarantees exclusive
/// ranges per writer.
unsafe
unsafe
/// A reference-counted handle to a [`Buffer`].
///
/// Shared between a [`FlushBuffer`] and the `io_uring` submission path, which
/// holds a pointer into the buffer while a write is in flight.
pub type SharedBuffer = ;
/// Bit 0 of the state word — set when the buffer is closed to new writers.
const SEALED_BIT: usize = 1 << 0;
/// Bit 1 of the state word — set while a flush is in progress.
///
/// Prevents a second flush from being fired concurrently and prevents new
/// writers from entering a buffer that is already being drained.
pub const FLUSH_IN_PROGRESS_BIT: usize = 1 << 1;
/// Amount added to the state word to record one additional active writer.
const WRITER_SHIFT: usize = 8;
const WRITER_ONE: usize = 1 << WRITER_SHIFT;
/// Mask covering the writer-count field (bits 8..32).
const WRITER_MASK: usize = 0x00FF_FFFF00;
/// The write-offset field occupies the top 32 bits of the state word.
const OFFSET_SHIFT: usize = 32;
/// Amount added to the state word to advance the write offset by one byte.
const OFFSET_ONE: usize = 1 << OFFSET_SHIFT;
/// Default number of buffers in a [`FlushBufferRing`].
pub const RING_SIZE: usize = 4;
/// The size of a 1 MB page
pub const ONE_MEGABYTE_BLOCK: usize = 1024 * 1024;
// The size of a 1 KB page
pub const FOUR_KB_BLOCK: usize = 4096;
/// Extracts the current offset out of the state variable
/// Extracts the current current number of writers out of the state variable
/// Returns the sealed bit of the state variable
/// Returns the flush in progress bit of the state variable
/// Errors that may be returned by buffer and ring operations.
/// Successful outcomes returned by buffer and ring operations.
/// A single 4 KB-aligned latch-free I/O buffer.
///
/// Multiple threads write into a `FlushBuffer` concurrently by atomically
/// claiming non-overlapping byte ranges through [`FlushBuffer::reserve_space`]. Once the
/// buffer is full (or explicitly sealed), it is dispatched to a [`QuikIO`] instance for
/// asynchronous flushes. It it subsequently reset for reuse
///
/// # State Word
///
/// ```text
/// ┌────────────────┬────────────────┬──────────────────┬───────────────────┬──────────┐
/// │ Bits 63..32 │ Bits 31..8 │ Bits 7..2 │ Bit 1 │ Bit 0 │
/// │ write offset │ writer count │ (reserved) │ flush-in-prog │ sealed │
/// └────────────────┴────────────────┴──────────────────┴───────────────────┴──────────┘
/// ```
///
/// All four fields are read and updated through a single [`AtomicUsize`], so
/// any snapshot is self-consistent: there are no TOCTOU races between the
/// offset, writer count, flush flag, and sealed flag.
///
/// # Safety
///
/// `FlushBuffer` is `Send + Sync`. The only `unsafe` access is inside
/// [`write`](Self::write), where a raw pointer into the aligned allocation is
/// dereferenced. Safety is upheld by the invariant that
/// [`reserve_space`](Self::reserve_space) grants each caller an exclusive,
/// non-overlapping byte range.
unsafe
unsafe