1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
// SPDX-License-Identifier: Apache-2.0
// Copyright (c) 2026-present, Structured World Foundation
//! Per-record length-prefixed framing for manifest sections.
//!
//! ## Why framing
//!
//! The pre-framing manifest format wrote each `tables` / `blob_files`
//! record back-to-back with no per-record header. A single corrupt
//! byte anywhere inside the section invalidated every record that
//! followed: the reader had no way to locate the start of the next
//! valid record, so recovery had to choose between (a) aborting the
//! open ([`ManifestRecoveryMode::AbsoluteConsistency`]) or
//! (b) accepting only a clean tail-truncation ([`ManifestRecoveryMode::TolerateCorruptedTailRecords`]).
//!
//! [`ManifestRecoveryMode::PointInTimeRecovery`] and
//! [`ManifestRecoveryMode::SkipAnyCorruptedRecords`] both need to do
//! more than that: PIT wants to stop at the last consistent
//! record-group boundary and accept the prefix; `SkipAny` wants to
//! skip one bad record and keep reading. Both modes need to know
//! the exact byte length of each record so they can step past one
//! without losing sync with the rest.
//!
//! ## Wire format
//!
//! Each framed record is:
//!
//! ```text
//! +----------------+----------------+-------------------+
//! | len: u32 LE | xxh3_64: u64 LE | payload: [u8; len] |
//! +----------------+----------------+-------------------+
//! ```
//!
//! - `len` is the size of `payload` (does NOT include the 12-byte
//! header itself). A `len` value larger than the section's
//! remaining capacity is treated as `TailTruncation`, not as
//! in-section corruption — under tolerant modes this lets a
//! power-loss-mid-record recovery accept the prefix instead of
//! aborting. The reader still does NOT trust the `len` for
//! skipping (the byte boundary of the next record cannot be
//! located from a partial trailing record), so the consumer
//! abandons the rest of the section regardless. Use the
//! `expected_payload_len` parameter on
//! [`read_framed_record`] when the record schema has a fixed
//! payload size (table / blob entries) to pin the `len`
//! structurally and rule out a "len happens to fit but is
//! wrong" alignment slide under `SkipAnyCorruptedRecords`.
//! - `xxh3_64` is `xxh3_64(payload)`. The 64-bit variant gives a
//! ≈ 2⁻⁶⁴ false-positive collision rate per record, matching the
//! integrity bar of the rest of the on-disk format.
//! - `payload` is the same bytes the pre-framing writer emitted
//! for that record. Migration cost is zero on the payload schema;
//! only the surrounding 12 bytes are new.
//!
//! ## Trade-off
//!
//! 12 bytes of header per record. For a `tables` section's 33-byte
//! table record this is ~36% overhead; for a `blob_files` section's
//! 25-byte record it is ~48%. The manifest is small (KiB-scale even
//! for trees with tens of thousands of tables), so the absolute
//! cost is negligible. The recovery flexibility — per-record skip,
//! exact record-group boundaries — is worth the overhead.
//!
//! [`ManifestRecoveryMode::AbsoluteConsistency`]: crate::config::ManifestRecoveryMode::AbsoluteConsistency
//! [`ManifestRecoveryMode::TolerateCorruptedTailRecords`]: crate::config::ManifestRecoveryMode::TolerateCorruptedTailRecords
//! [`ManifestRecoveryMode::PointInTimeRecovery`]: crate::config::ManifestRecoveryMode::PointInTimeRecovery
//! [`ManifestRecoveryMode::SkipAnyCorruptedRecords`]: crate::config::ManifestRecoveryMode::SkipAnyCorruptedRecords
use crate;
use Vec;
use crate;
/// Size of the framing header in bytes (4 B `len` + 8 B `xxh3_64`).
pub const FRAME_HEADER_LEN: usize = 4 + 8;
/// Hard cap on `len` — keeps an obviously-forged value from
/// triggering an allocation that exceeds reasonable manifest
/// record size. The largest legitimate record today is the
/// `tables` per-table entry at 33 bytes; even a hypothetical
/// future record with a comparator name string is bounded by the
/// `comparator_name` length cap upstream. 64 KiB is a generous
/// ceiling that still cuts off any `len` that would otherwise
/// trigger a multi-megabyte allocation on a corrupt header.
pub const MAX_FRAME_PAYLOAD: u32 = 64 * 1024;
/// Writes a framed record. The closure-provided payload is
/// assembled in a temporary `Vec<u8>` first so the `len` and
/// XXH3-64 digest can be computed from the actual emitted bytes;
/// the 12-byte header is then written in a single pass before
/// the payload (no seek/backpatch is involved — both header
/// fields are known by the time the first byte of the header
/// reaches `writer`).
///
/// # Errors
///
/// Returns the I/O error from `writer` if any write fails,
/// surfaces any error returned by `payload_fn`, or returns
/// [`crate::Error::Unrecoverable`] when the payload exceeds
/// [`MAX_FRAME_PAYLOAD`] — emitting an oversized record would
/// produce a frame the reader always rejects as
/// [`FramedRecordOutcome::BadHeader`], silently bricking
/// recovery for that section.
/// Result of [`read_framed_record`] — exposes the per-record
/// outcome so the caller can apply mode-specific recovery policy.
/// Reads one framed record. Never panics or aborts on bad bytes —
/// the outcome variant tells the caller what happened and how
/// many bytes (if any) were consumed.
///
/// `remaining_in_section` lets the reader reject a `len` value
/// that exceeds the section payload bound, catching the case
/// where the header itself is corrupt and the length field points
/// well past the legitimate end. Pass `u64::MAX` if the section
/// boundary is not known.
///
/// `expected_payload_len`, when `Some(n)`, pins the record to a
/// fixed payload size: any `len != n` is treated as
/// [`FramedRecordOutcome::LenMismatch`] BEFORE the payload is
/// consumed, so a corrupted-but-plausible `len` (still within
/// `MAX_FRAME_PAYLOAD` and the section bound) cannot mis-align
/// the cursor for the next record. This is the critical safety
/// net for [`crate::config::ManifestRecoveryMode::SkipAnyCorruptedRecords`]:
/// without the fixed-length pin, a corrupt `len` would consume
/// the wrong number of payload bytes, fail the XXH3 check, then
/// have the `SkipAny` arm "continue past the record" — but the
/// cursor is now off by `(corrupt_len - real_len)` bytes and the
/// next read decodes garbage as a new record. With the pin, the
/// reader stops at `LenMismatch` (cursor has consumed only the
/// 4-byte `len`, no payload bytes); the recovery callers (see
/// `src/version/recovery.rs`) hard-abort on `LenMismatch` in
/// EVERY mode rather than dropping the rest of the section.
/// This is the deliberate distinction from
/// [`FramedRecordOutcome::BadHeader`] (truly implausible
/// `len > MAX_FRAME_PAYLOAD`, treated as in-section corruption
/// the tolerant modes can absorb): a size disagreement with the
/// caller's fixed-length pin can be either writer / reader
/// format drift OR a corrupted length field that still fits
/// `MAX_FRAME_PAYLOAD` — the reader cannot tell the two apart
/// and either way silently masking it via a section-drop would
/// either let an incompatible on-disk schema slip through
/// tolerant recovery undetected or compound the in-record damage.
///
/// Pass `None` for variable-size records (none currently exist
/// in the manifest, but the parameter is kept open-ended for
/// future record types).
///
/// # Errors
///
/// Returns the underlying [`std::io::Error`] when a read fails for
/// reasons other than EOF (the EOF case maps to
/// [`FramedRecordOutcome::TailTruncation`]). Decode-time errors
/// (checksum mismatch, oversized header) are surfaced via the
/// returned [`FramedRecordOutcome`] variant rather than `Err`.