1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
//! hash-roll provides various content defined chunking algorithms
//!
//! Content defined chunking (CDC) algorithms are algorithms that examine a stream of input bytes (often
//! represented as a slice like `[u8]`, and provide locations within that input stream to split or
//! chunk the stream into parts.
//!
//! CDC algorithms generally try to optimize for the following:
//!
//!  1. Processing speed (ie: bytes/second)
//!  2. Stability in split locations even when insertions/deletions of bytes occur
//!  3. Reasonable distributions of chunk lengths
//!
//! ## API Concepts
//!
//! - Configured Algorithm Instance (impliments [`Chunk`]). Normally named plainly (like
//!   [`Bup`]). These can be thought of as "parameters" for an algorithm.
//! - Incrimental (impliments [`ChunkIncr`]). Normally named with `Incr` suffix.
//!
//! Because of the various ways one might use a CDC, and the different CDC algorithm
//! characteristics, hash-roll provides a few ways to use them.
//!
//! Configured Algorithm Instances are created from the set of configuration needed for a given
//! algorithm. For example, this might mean configuring a window size or how to decide where to
//! split. These don't include any mutable data, in other words: they don't keep track of what data
//! is given to them. Configured Algorithm Instances provide the all-at-once APIs, as well as
//! methods to obtain other kinds of APIs, like incrimental style apis.
//!
//! ## CDC Algorithms and Window Buffering
//!
//! Different CDC algorithms have different constraints about how they process data. Notably, some
//! require a large amount of previous processed data to process additional data. This "large
//! amount of previously processed data" is typically referred to as the "window". That said, note
//! that some CDC algorithms that use a window concept don't need previously accessed data.
//!
//! For the window-buffering algorithms, their is an extra cost to certain types of API
//! implimentations. The documentation will note when these occur and suggest alternatives.
//!
//! Generally, CDC interfaces that are incrimental will be slower for window-buffering algorithms.
//! Using an explicitly allocating interface (which emits `Vec<u8>` or `Vec<Vec<u8>>`) will have no
//! worse performance that the incrimental API, but might be more convenient. Using an all-at-once
//! API will provide the best performance due to not requiring any buffering (the input data can be
//! used directly).
//!
//! ## Use Cases that drive API choices
//!
//!  - accumulate vecs, emits vecs
//!    - incrimental: yes
//!    - input: `Vec<u8>`
//!    - internal state: `Vec<Vec<u8>>`
//!    - output: `Vec<Vec<u8>>`
//!
//!  - stream data through
//!    - incrimenal: yes
//!    - input: `&[u8]`
//!
//!  - mmap (or read entire) file, emit
//!    - incrimenal: no
//!    - input: `&[u8]`
//!    - output: `&[u8]`

// # API Design Notes
//
// ## What methods should be in a trait? What should be in wrapper structs?
//
//  - place methods that might have more optimized variants, but can have common implimentations,
//    in a trait. This notably affects window-buffering differences: it's always possible to
//    impliment all-at-once processing using incrimental interfaces that internally buffer, but
//    it's much more efficient for window-buffering algorithms to provide implimentations that know
//    how to look into the input data directly.

#![warn(rust_2018_idioms, missing_debug_implementations)]
/* TODO: Rabin-Karp
 * H = c_1 * a ** (k-1) + c_2 * a ** (k-2) ... + c_k * a ** 0
 * where:
 *  a is a constant
 *  c_1, ..., c_k are the input characters
 *
 * All math is done modulo n. Choice of n & a critical
 *
 * Parameters:
 *  - n: mululo limit
 *  - a: a constant
 *
 * State:
 *  H
 *
 * Application:
 */

/* TODO:
 * rollsum of librsync
 */

use std::mem;

pub mod bup;
pub mod buzhash;
pub mod buzhash_table;
pub mod fastcdc;
pub mod gear;
pub mod gear_table;
pub mod gzip;
pub mod mii;
pub mod pigz;
pub mod ram;
pub mod range;
pub mod zpaq;
pub mod zstd;

pub(crate) use range::RangeExt;

/// Accept incrimental input and provide indexes of split points
///
/// Compared to [`Chunk`], [`ChunkIncr`] allows avoiding having to buffer all input data in memory,
/// and avoids the need to use a single buffer for storing the input data (even if all data is in
/// memory).
///
/// Data fed into a given [`ChunkIncr`] instance is considered to be part of the same
/// data "source". This affects chunking algorithms that maintain some state between chunks
/// (like `ZstdRsyncable` does). If you have multiple "sources", one should obtain new instances of
/// [`ChunkIncr`] for each of them (typically via [`ToChunkIncr`]).
///
/// Note that for some splitting/chunking algorithms, the incrimental api will be less efficient
/// compared to the non-incrimental API. In particular, algorithms like [`Rsyncable`] that require
/// the use of previously examined data to shift their "window" (resulting in needing a circular
/// buffer which all inputed data passes through) will perform more poorly using [`ChunkIncr`]
/// compared with non-incrimental interfaces
pub trait ChunkIncr {
    /// The data "contained" within a implimentor of this trait is the history of all data slices
    /// passed to feed.
    ///
    /// In other words, all previous data (or no previous data) may be used in determining the
    /// point to split.
    ///
    /// Returns None if the data has no split point.
    /// Otherwise, returns an index in the most recently passed `data`.
    ///
    /// Note that returning the index in the current slice makes most "look-ahead" splitting
    /// impossible (as it is permissible to pass 1 byte at a time).
    fn push(&mut self, data: &[u8]) -> Option<usize>;


    /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker.
    ///
    /// Will always return enough slices to form the entire content of `data`, even if the trailing
    /// part of data is not a chunk (ie: does not end on a chunk boundary)
    fn iter_slices<'a>(self, data: &'a [u8]) -> IterSlices<'a, Self>
    where
        Self: std::marker::Sized,
    {
        IterSlices {
            rem: data,
            chunker: self,
        }
    }

    /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker.
    /// Does not return the remainder (if any) in the iteration. Use [`IterSlices::take_rem()`] or
    /// [`IterSlices::into_parts()`] to get the remainder.
    ///
    /// Note that this is a non-incrimental interface. Calling this on an already fed chunker or using
    /// this multiple times on the same chunker may provide unexpected results
    fn iter_slices_strict<'a>(self, data: &'a [u8]) -> IterSlicesStrict<'a, Self>
    where
        Self: std::marker::Sized,
    {
        IterSlicesStrict {
            rem: data,
            chunker: self,
        }
    }
}

/// Returned by [`ChunkIncr::iter_slices_strict()`]
///
/// Always emits _complete_ slices durring iteration.
#[derive(Debug)]
pub struct IterSlicesStrict<'a, C: ChunkIncr> {
    rem: &'a [u8],
    chunker: C,
}

impl<'a, C: ChunkIncr> IterSlicesStrict<'a, C> {
    /// Take the remainder from this iterator. Leaves an empty slice in it's place.
    pub fn take_rem(&mut self) -> &'a [u8] {
        let mut l: &[u8] = &[];
        mem::swap(&mut self.rem, &mut l);
        l
    }

    /// Obtain the internals
    ///
    /// Useful, for example, after iteration stops to obtain the remaining slice.
    pub fn into_parts(self) -> (C, &'a [u8]) {
        (self.chunker, self.rem)
    }
}

impl<'a, C: ChunkIncr> Iterator for IterSlicesStrict<'a, C> {
    type Item = &'a [u8];

    fn next(&mut self) -> Option<Self::Item> {
        match self.chunker.push(self.rem) {
            None => None,
            Some(l) => {
                let (v, rn) = self.rem.split_at(l);
                self.rem = rn;
                Some(v)
            }
        }
    }
}

/// Returned by [`ChunkIncr::iter_slices()`]
///
/// When it runs out of data, it returns the remainder as the last element of the iteration
#[derive(Debug)]
pub struct IterSlices<'a, C: ChunkIncr> {
    rem: &'a [u8],
    chunker: C,
}

impl<'a, C: ChunkIncr> IterSlices<'a, C> {
    /// Obtain the internals
    ///
    /// Useful, for example, after iteration stops to obtain the remaining slice.
    pub fn into_parts(self) -> (C, &'a [u8]) {
        (self.chunker, self.rem)
    }
}

impl<'a, C: ChunkIncr> Iterator for IterSlices<'a, C> {
    type Item = &'a [u8];

    fn next(&mut self) -> Option<Self::Item> {
        if self.rem.is_empty() {
            return None;
        }

        match self.chunker.push(self.rem) {
            None => {
                let v = self.rem;
                self.rem = &[];
                Some(v)
            }
            Some(l) => {
                let (v, rn) = self.rem.split_at(l);
                self.rem = rn;
                Some(v)
            }
        }
    }
}

/// Impl on algorthms that define methods of chunking data
///
/// This is the lowest level (but somewhat restrictive) trait for chunking algorthms.  It assumes
/// that the input is provided to it in a contiguous slice. If you don't have your input as a
/// contiguous slice, [`ChunkIncr`] may be a better choice (it allows non-contiguous input, but may
/// be slowing for some chunking algorthms).
pub trait Chunk {
    /// `SearchState` allows searching for the chunk edge to resume without duplicating work
    /// already done.
    type SearchState;

    /*
    /// Amount of data from already emitted chunks requried for determining future chunks
    ///
    /// Indicates the amount of data that _must_ be preserved for [`find_chunk_edge()`]'s
    /// `prev_data` argument. If more that this is passed, the last bytes in the slice are used. At
    /// the start of an input (where there is no previous data), an empty slice would be used.
    ///
    /// For most chunking algorithms, this is `0` (zero), indicating that `prev_data` may always be
    /// an empty slice.
    const CARRY_LEN: usize;
    */

    /// Provide an initial [`SearchState`] for use with [`find_chunk_edge()`]. Generally, for each
    /// input one should generate a new [`SearchState`].
    fn to_search_state(&self) -> Self::SearchState;

    /// Find the next "chunk" in `data` to emit
    ///
    /// The return value is a pair of a range representing the start and end of the chunk being
    /// emitted, and the offset from which subsequent `data` subsets should be passed to the next
    /// call to `find_chunk_edge`.
    ///
    /// `state` is mutated so that it does not rexamine previously examined data, even when a chunk
    /// is not emitted.
    ///
    /// `data` may be extended with additional data between calls to `find_chunk_edge()`. The bytes
    /// that were _previously_ in `data` and are not indicated by `discard_ct` must be preserved in
    /// the next `data` buffer called.
    ///
    /// ```rust
    /// use hash_roll::Chunk;
    ///
    /// fn some_chunk() -> impl Chunk {
    ///     hash_roll::mii::Mii::default()
    /// }
    ///
    /// let chunk = some_chunk();
    /// let orig_data = b"hello";
    /// let mut data = &orig_data[..];
    /// let mut ss = chunk.to_search_state();
    /// let mut prev_cut = 0;
    ///
    /// loop {
    ///    let (chunk, discard_ct) = chunk.find_chunk_edge(&mut ss, data);
    ///
    ///    match chunk {
    ///        Some(cut_point) => {
    ///            // map `cut_point` from the current slice back into the original slice so we can
    ///            // have consistent indexes
    ///            let g_cut = cut_point + orig_data.len() - data.len();
    ///            println!("chunk: {:?}", &orig_data[prev_cut..cut_point]);
    ///        },
    ///        None => {
    ///            println!("no chunk, done with data we have");
    ///            println!("remain: {:?}", &data[discard_ct..]);
    ///            break;
    ///        }
    ///    }
    ///
    ///    data = &data[discard_ct..];
    /// }
    /// ```
    ///
    /// Note: call additional times on the same `SearchState` and the required `data` to obtain
    /// subsequent chunks in the same input data. To handle a seperate input, use a new
    /// `SearchState`.
    ///
    /// Note: calling with a previous `state` with a new `data` that isn't an extention of the
    /// previous `data` will result in split points that may not follow the design of the
    /// underlying algorithm. Avoid relying on consistent cut points to reason about memory safety.
    ///
    // NOTE: the reason that we preserve `state` even when chunks are emitted is that some
    // algorthims require some state to pass between chunks for a given input. zstd includes an
    // example of an algorithm that needs this
    //
    // Potential pitfal: for better performance, keeping the return value small is a very good
    // idea. By returning ~2x64+32, we are might be less performant depending on the ABI selected.
    //
    // Consider if result should return `(&[u8], &[u8])` instead of an index (which would then be
    // given to `.split_at()`
    //
    // Consider if `state` should have a `reset()` method to avoid reallocating
    //
    // API:
    //  - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<(usize, uszie)>, usize);
    //     - Problem: unclear what indexes of slices represent: start can't be in the data being
    //       passed because we don't require `data` include the start of the chunk
    //  - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<usize>, usize);
    //     - Problem: user code to track indexing match up is somewhat difficult
    //     - mostly due to needing an extra index to track to handle the "last chunk" location not
    //     being the "slice we need to pass start"
    fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8])
        -> (Option<usize>, usize);
}

/// Implimented on types which can be converted to/can provide a [`ChunkIncr`] interface.
///
/// Types that impliment this generally represent a instantiation of a chunking algorithm.
// NOTE: we use this instead of just having `From<&C: Chunk> for CI: ChunkIncr` because there is
// _one_ `ChunkIncr` for each `Chunk`, and rust can't infer that when using a `From` or `Into`
// bound.
//
// We could consider adding `type Incr` into `trait Chunk`, or only having `type Incr`
pub trait ToChunkIncr {
    /// `Incr` provides the incrimental interface to this chunking instance
    type Incr: ChunkIncr;

    /// `to_chunk_incr()` returns a [`ChunkIncr`] which can be incrimentally fed data and emits
    /// chunks.
    ///
    /// Generally, this is a typically low cost operation that copies from the implimentor or does
    /// minor computation on its fields and may allocate some memory for storing additional state
    /// needed for incrimental computation.
    fn to_chunk_incr(&self) -> Self::Incr;
}