1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
//! hash-roll provides various content defined chunking algorithms
//!
//! Content defined chunking (CDC) algorithms are algorithms that examine a stream of input bytes (often
//! represented as a slice like `[u8]`, and provide locations within that input stream to split or
//! chunk the stream into parts.
//!
//! CDC algorithms generally try to optimize for the following:
//!
//! 1. Processing speed (ie: bytes/second)
//! 2. Stability in split locations even when insertions/deletions of bytes occur
//! 3. Reasonable distributions of chunk lengths
//!
//! ## API Concepts
//!
//! - Configured Algorithm Instance (impliments [`Chunk`]). Normally named plainly (like
//! [`Bup`]). These can be thought of as "parameters" for an algorithm.
//! - Incrimental (impliments [`ChunkIncr`]). Normally named with `Incr` suffix.
//!
//! Because of the various ways one might use a CDC, and the different CDC algorithm
//! characteristics, hash-roll provides a few ways to use them.
//!
//! Configured Algorithm Instances are created from the set of configuration needed for a given
//! algorithm. For example, this might mean configuring a window size or how to decide where to
//! split. These don't include any mutable data, in other words: they don't keep track of what data
//! is given to them. Configured Algorithm Instances provide the all-at-once APIs, as well as
//! methods to obtain other kinds of APIs, like incrimental style apis.
//!
//! ## CDC Algorithms and Window Buffering
//!
//! Different CDC algorithms have different constraints about how they process data. Notably, some
//! require a large amount of previous processed data to process additional data. This "large
//! amount of previously processed data" is typically referred to as the "window". That said, note
//! that some CDC algorithms that use a window concept don't need previously accessed data.
//!
//! For the window-buffering algorithms, their is an extra cost to certain types of API
//! implimentations. The documentation will note when these occur and suggest alternatives.
//!
//! Generally, CDC interfaces that are incrimental will be slower for window-buffering algorithms.
//! Using an explicitly allocating interface (which emits `Vec<u8>` or `Vec<Vec<u8>>`) will have no
//! worse performance that the incrimental API, but might be more convenient. Using an all-at-once
//! API will provide the best performance due to not requiring any buffering (the input data can be
//! used directly).
//!
//! ## Use Cases that drive API choices
//!
//! - accumulate vecs, emits vecs
//! - incrimental: yes
//! - input: `Vec<u8>`
//! - internal state: `Vec<Vec<u8>>`
//! - output: `Vec<Vec<u8>>`
//!
//! - stream data through
//! - incrimenal: yes
//! - input: `&[u8]`
//!
//! - mmap (or read entire) file, emit
//! - incrimenal: no
//! - input: `&[u8]`
//! - output: `&[u8]`
// # API Design Notes
//
// ## What methods should be in a trait? What should be in wrapper structs?
//
// - place methods that might have more optimized variants, but can have common implimentations,
// in a trait. This notably affects window-buffering differences: it's always possible to
// impliment all-at-once processing using incrimental interfaces that internally buffer, but
// it's much more efficient for window-buffering algorithms to provide implimentations that know
// how to look into the input data directly.
/* TODO: Rabin-Karp
* H = c_1 * a ** (k-1) + c_2 * a ** (k-2) ... + c_k * a ** 0
* where:
* a is a constant
* c_1, ..., c_k are the input characters
*
* All math is done modulo n. Choice of n & a critical
*
* Parameters:
* - n: mululo limit
* - a: a constant
*
* State:
* H
*
* Application:
*/
/* TODO:
* rollsum of librsync
*/
use mem;
pub use RangeExt;
/// Accept incrimental input and provide indexes of split points
///
/// Compared to [`Chunk`], [`ChunkIncr`] allows avoiding having to buffer all input data in memory,
/// and avoids the need to use a single buffer for storing the input data (even if all data is in
/// memory).
///
/// Data fed into a given [`ChunkIncr`] instance is considered to be part of the same
/// data "source". This affects chunking algorithms that maintain some state between chunks
/// (like `ZstdRsyncable` does). If you have multiple "sources", one should obtain new instances of
/// [`ChunkIncr`] for each of them (typically via [`ToChunkIncr`]).
///
/// Note that for some splitting/chunking algorithms, the incrimental api will be less efficient
/// compared to the non-incrimental API. In particular, algorithms like [`Rsyncable`] that require
/// the use of previously examined data to shift their "window" (resulting in needing a circular
/// buffer which all inputed data passes through) will perform more poorly using [`ChunkIncr`]
/// compared with non-incrimental interfaces
/// Returned by [`ChunkIncr::iter_slices_strict()`]
///
/// Always emits _complete_ slices durring iteration.
/// Returned by [`ChunkIncr::iter_slices()`]
///
/// When it runs out of data, it returns the remainder as the last element of the iteration
/// Impl on algorthms that define methods of chunking data
///
/// This is the lowest level (but somewhat restrictive) trait for chunking algorthms. It assumes
/// that the input is provided to it in a contiguous slice. If you don't have your input as a
/// contiguous slice, [`ChunkIncr`] may be a better choice (it allows non-contiguous input, but may
/// be slowing for some chunking algorthms).
/// Implimented on types which can be converted to/can provide a [`ChunkIncr`] interface.
///
/// Types that impliment this generally represent a instantiation of a chunking algorithm.
// NOTE: we use this instead of just having `From<&C: Chunk> for CI: ChunkIncr` because there is
// _one_ `ChunkIncr` for each `Chunk`, and rust can't infer that when using a `From` or `Into`
// bound.
//
// We could consider adding `type Incr` into `trait Chunk`, or only having `type Incr`