hash_roll/
lib.rs

1//! hash-roll provides various content defined chunking algorithms
2//!
3//! Content defined chunking (CDC) algorithms are algorithms that examine a stream of input bytes (often
4//! represented as a slice like `[u8]`, and provide locations within that input stream to split or
5//! chunk the stream into parts.
6//!
7//! CDC algorithms generally try to optimize for the following:
8//!
9//!  1. Processing speed (ie: bytes/second)
10//!  2. Stability in split locations even when insertions/deletions of bytes occur
11//!  3. Reasonable distributions of chunk lengths
12//!
13//! ## API Concepts
14//!
15//! - Configured Algorithm Instance (impliments [`Chunk`]). Normally named plainly (like
16//!   [`Bup`]). These can be thought of as "parameters" for an algorithm.
17//! - Incrimental (impliments [`ChunkIncr`]). Normally named with `Incr` suffix.
18//!
19//! Because of the various ways one might use a CDC, and the different CDC algorithm
20//! characteristics, hash-roll provides a few ways to use them.
21//!
22//! Configured Algorithm Instances are created from the set of configuration needed for a given
23//! algorithm. For example, this might mean configuring a window size or how to decide where to
24//! split. These don't include any mutable data, in other words: they don't keep track of what data
25//! is given to them. Configured Algorithm Instances provide the all-at-once APIs, as well as
26//! methods to obtain other kinds of APIs, like incrimental style apis.
27//!
28//! ## CDC Algorithms and Window Buffering
29//!
30//! Different CDC algorithms have different constraints about how they process data. Notably, some
31//! require a large amount of previous processed data to process additional data. This "large
32//! amount of previously processed data" is typically referred to as the "window". That said, note
33//! that some CDC algorithms that use a window concept don't need previously accessed data.
34//!
35//! For the window-buffering algorithms, their is an extra cost to certain types of API
36//! implimentations. The documentation will note when these occur and suggest alternatives.
37//!
38//! Generally, CDC interfaces that are incrimental will be slower for window-buffering algorithms.
39//! Using an explicitly allocating interface (which emits `Vec<u8>` or `Vec<Vec<u8>>`) will have no
40//! worse performance that the incrimental API, but might be more convenient. Using an all-at-once
41//! API will provide the best performance due to not requiring any buffering (the input data can be
42//! used directly).
43//!
44//! ## Use Cases that drive API choices
45//!
46//!  - accumulate vecs, emits vecs
47//!    - incrimental: yes
48//!    - input: `Vec<u8>`
49//!    - internal state: `Vec<Vec<u8>>`
50//!    - output: `Vec<Vec<u8>>`
51//!
52//!  - stream data through
53//!    - incrimenal: yes
54//!    - input: `&[u8]`
55//!
56//!  - mmap (or read entire) file, emit
57//!    - incrimenal: no
58//!    - input: `&[u8]`
59//!    - output: `&[u8]`
60
61// # API Design Notes
62//
63// ## What methods should be in a trait? What should be in wrapper structs?
64//
65//  - place methods that might have more optimized variants, but can have common implimentations,
66//    in a trait. This notably affects window-buffering differences: it's always possible to
67//    impliment all-at-once processing using incrimental interfaces that internally buffer, but
68//    it's much more efficient for window-buffering algorithms to provide implimentations that know
69//    how to look into the input data directly.
70
71#![warn(rust_2018_idioms, missing_debug_implementations)]
72/* TODO: Rabin-Karp
73 * H = c_1 * a ** (k-1) + c_2 * a ** (k-2) ... + c_k * a ** 0
74 * where:
75 *  a is a constant
76 *  c_1, ..., c_k are the input characters
77 *
78 * All math is done modulo n. Choice of n & a critical
79 *
80 * Parameters:
81 *  - n: mululo limit
82 *  - a: a constant
83 *
84 * State:
85 *  H
86 *
87 * Application:
88 */
89
90/* TODO:
91 * rollsum of librsync
92 */
93
94use std::mem;
95
96pub mod bup;
97pub mod buzhash;
98pub mod buzhash_table;
99pub mod fastcdc;
100pub mod gear;
101pub mod gear_table;
102pub mod gzip;
103pub mod mii;
104pub mod pigz;
105pub mod ram;
106pub mod range;
107pub mod zpaq;
108pub mod zstd;
109
110pub(crate) use range::RangeExt;
111
112/// Accept incrimental input and provide indexes of split points
113///
114/// Compared to [`Chunk`], [`ChunkIncr`] allows avoiding having to buffer all input data in memory,
115/// and avoids the need to use a single buffer for storing the input data (even if all data is in
116/// memory).
117///
118/// Data fed into a given [`ChunkIncr`] instance is considered to be part of the same
119/// data "source". This affects chunking algorithms that maintain some state between chunks
120/// (like `ZstdRsyncable` does). If you have multiple "sources", one should obtain new instances of
121/// [`ChunkIncr`] for each of them (typically via [`ToChunkIncr`]).
122///
123/// Note that for some splitting/chunking algorithms, the incrimental api will be less efficient
124/// compared to the non-incrimental API. In particular, algorithms like [`Rsyncable`] that require
125/// the use of previously examined data to shift their "window" (resulting in needing a circular
126/// buffer which all inputed data passes through) will perform more poorly using [`ChunkIncr`]
127/// compared with non-incrimental interfaces
128pub trait ChunkIncr {
129    /// The data "contained" within a implimentor of this trait is the history of all data slices
130    /// passed to feed.
131    ///
132    /// In other words, all previous data (or no previous data) may be used in determining the
133    /// point to split.
134    ///
135    /// Returns None if the data has no split point.
136    /// Otherwise, returns an index in the most recently passed `data`.
137    ///
138    /// Note that returning the index in the current slice makes most "look-ahead" splitting
139    /// impossible (as it is permissible to pass 1 byte at a time).
140    fn push(&mut self, data: &[u8]) -> Option<usize>;
141
142
143    /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker.
144    ///
145    /// Will always return enough slices to form the entire content of `data`, even if the trailing
146    /// part of data is not a chunk (ie: does not end on a chunk boundary)
147    fn iter_slices<'a>(self, data: &'a [u8]) -> IterSlices<'a, Self>
148    where
149        Self: std::marker::Sized,
150    {
151        IterSlices {
152            rem: data,
153            chunker: self,
154        }
155    }
156
157    /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker.
158    /// Does not return the remainder (if any) in the iteration. Use [`IterSlices::take_rem()`] or
159    /// [`IterSlices::into_parts()`] to get the remainder.
160    ///
161    /// Note that this is a non-incrimental interface. Calling this on an already fed chunker or using
162    /// this multiple times on the same chunker may provide unexpected results
163    fn iter_slices_strict<'a>(self, data: &'a [u8]) -> IterSlicesStrict<'a, Self>
164    where
165        Self: std::marker::Sized,
166    {
167        IterSlicesStrict {
168            rem: data,
169            chunker: self,
170        }
171    }
172}
173
174/// Returned by [`ChunkIncr::iter_slices_strict()`]
175///
176/// Always emits _complete_ slices durring iteration.
177#[derive(Debug)]
178pub struct IterSlicesStrict<'a, C: ChunkIncr> {
179    rem: &'a [u8],
180    chunker: C,
181}
182
183impl<'a, C: ChunkIncr> IterSlicesStrict<'a, C> {
184    /// Take the remainder from this iterator. Leaves an empty slice in it's place.
185    pub fn take_rem(&mut self) -> &'a [u8] {
186        let mut l: &[u8] = &[];
187        mem::swap(&mut self.rem, &mut l);
188        l
189    }
190
191    /// Obtain the internals
192    ///
193    /// Useful, for example, after iteration stops to obtain the remaining slice.
194    pub fn into_parts(self) -> (C, &'a [u8]) {
195        (self.chunker, self.rem)
196    }
197}
198
199impl<'a, C: ChunkIncr> Iterator for IterSlicesStrict<'a, C> {
200    type Item = &'a [u8];
201
202    fn next(&mut self) -> Option<Self::Item> {
203        match self.chunker.push(self.rem) {
204            None => None,
205            Some(l) => {
206                let (v, rn) = self.rem.split_at(l);
207                self.rem = rn;
208                Some(v)
209            }
210        }
211    }
212}
213
214/// Returned by [`ChunkIncr::iter_slices()`]
215///
216/// When it runs out of data, it returns the remainder as the last element of the iteration
217#[derive(Debug)]
218pub struct IterSlices<'a, C: ChunkIncr> {
219    rem: &'a [u8],
220    chunker: C,
221}
222
223impl<'a, C: ChunkIncr> IterSlices<'a, C> {
224    /// Obtain the internals
225    ///
226    /// Useful, for example, after iteration stops to obtain the remaining slice.
227    pub fn into_parts(self) -> (C, &'a [u8]) {
228        (self.chunker, self.rem)
229    }
230}
231
232impl<'a, C: ChunkIncr> Iterator for IterSlices<'a, C> {
233    type Item = &'a [u8];
234
235    fn next(&mut self) -> Option<Self::Item> {
236        if self.rem.is_empty() {
237            return None;
238        }
239
240        match self.chunker.push(self.rem) {
241            None => {
242                let v = self.rem;
243                self.rem = &[];
244                Some(v)
245            }
246            Some(l) => {
247                let (v, rn) = self.rem.split_at(l);
248                self.rem = rn;
249                Some(v)
250            }
251        }
252    }
253}
254
255/// Impl on algorthms that define methods of chunking data
256///
257/// This is the lowest level (but somewhat restrictive) trait for chunking algorthms.  It assumes
258/// that the input is provided to it in a contiguous slice. If you don't have your input as a
259/// contiguous slice, [`ChunkIncr`] may be a better choice (it allows non-contiguous input, but may
260/// be slowing for some chunking algorthms).
261pub trait Chunk {
262    /// `SearchState` allows searching for the chunk edge to resume without duplicating work
263    /// already done.
264    type SearchState;
265
266    /*
267    /// Amount of data from already emitted chunks requried for determining future chunks
268    ///
269    /// Indicates the amount of data that _must_ be preserved for [`find_chunk_edge()`]'s
270    /// `prev_data` argument. If more that this is passed, the last bytes in the slice are used. At
271    /// the start of an input (where there is no previous data), an empty slice would be used.
272    ///
273    /// For most chunking algorithms, this is `0` (zero), indicating that `prev_data` may always be
274    /// an empty slice.
275    const CARRY_LEN: usize;
276    */
277
278    /// Provide an initial [`SearchState`] for use with [`find_chunk_edge()`]. Generally, for each
279    /// input one should generate a new [`SearchState`].
280    fn to_search_state(&self) -> Self::SearchState;
281
282    /// Find the next "chunk" in `data` to emit
283    ///
284    /// The return value is a pair of a range representing the start and end of the chunk being
285    /// emitted, and the offset from which subsequent `data` subsets should be passed to the next
286    /// call to `find_chunk_edge`.
287    ///
288    /// `state` is mutated so that it does not rexamine previously examined data, even when a chunk
289    /// is not emitted.
290    ///
291    /// `data` may be extended with additional data between calls to `find_chunk_edge()`. The bytes
292    /// that were _previously_ in `data` and are not indicated by `discard_ct` must be preserved in
293    /// the next `data` buffer called.
294    ///
295    /// ```rust
296    /// use hash_roll::Chunk;
297    ///
298    /// fn some_chunk() -> impl Chunk {
299    ///     hash_roll::mii::Mii::default()
300    /// }
301    ///
302    /// let chunk = some_chunk();
303    /// let orig_data = b"hello";
304    /// let mut data = &orig_data[..];
305    /// let mut ss = chunk.to_search_state();
306    /// let mut prev_cut = 0;
307    ///
308    /// loop {
309    ///    let (chunk, discard_ct) = chunk.find_chunk_edge(&mut ss, data);
310    ///
311    ///    match chunk {
312    ///        Some(cut_point) => {
313    ///            // map `cut_point` from the current slice back into the original slice so we can
314    ///            // have consistent indexes
315    ///            let g_cut = cut_point + orig_data.len() - data.len();
316    ///            println!("chunk: {:?}", &orig_data[prev_cut..cut_point]);
317    ///        },
318    ///        None => {
319    ///            println!("no chunk, done with data we have");
320    ///            println!("remain: {:?}", &data[discard_ct..]);
321    ///            break;
322    ///        }
323    ///    }
324    ///
325    ///    data = &data[discard_ct..];
326    /// }
327    /// ```
328    ///
329    /// Note: call additional times on the same `SearchState` and the required `data` to obtain
330    /// subsequent chunks in the same input data. To handle a seperate input, use a new
331    /// `SearchState`.
332    ///
333    /// Note: calling with a previous `state` with a new `data` that isn't an extention of the
334    /// previous `data` will result in split points that may not follow the design of the
335    /// underlying algorithm. Avoid relying on consistent cut points to reason about memory safety.
336    ///
337    // NOTE: the reason that we preserve `state` even when chunks are emitted is that some
338    // algorthims require some state to pass between chunks for a given input. zstd includes an
339    // example of an algorithm that needs this
340    //
341    // Potential pitfal: for better performance, keeping the return value small is a very good
342    // idea. By returning ~2x64+32, we are might be less performant depending on the ABI selected.
343    //
344    // Consider if result should return `(&[u8], &[u8])` instead of an index (which would then be
345    // given to `.split_at()`
346    //
347    // Consider if `state` should have a `reset()` method to avoid reallocating
348    //
349    // API:
350    //  - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<(usize, uszie)>, usize);
351    //     - Problem: unclear what indexes of slices represent: start can't be in the data being
352    //       passed because we don't require `data` include the start of the chunk
353    //  - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<usize>, usize);
354    //     - Problem: user code to track indexing match up is somewhat difficult
355    //     - mostly due to needing an extra index to track to handle the "last chunk" location not
356    //     being the "slice we need to pass start"
357    fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8])
358        -> (Option<usize>, usize);
359}
360
361/// Implimented on types which can be converted to/can provide a [`ChunkIncr`] interface.
362///
363/// Types that impliment this generally represent a instantiation of a chunking algorithm.
364// NOTE: we use this instead of just having `From<&C: Chunk> for CI: ChunkIncr` because there is
365// _one_ `ChunkIncr` for each `Chunk`, and rust can't infer that when using a `From` or `Into`
366// bound.
367//
368// We could consider adding `type Incr` into `trait Chunk`, or only having `type Incr`
369pub trait ToChunkIncr {
370    /// `Incr` provides the incrimental interface to this chunking instance
371    type Incr: ChunkIncr;
372
373    /// `to_chunk_incr()` returns a [`ChunkIncr`] which can be incrimentally fed data and emits
374    /// chunks.
375    ///
376    /// Generally, this is a typically low cost operation that copies from the implimentor or does
377    /// minor computation on its fields and may allocate some memory for storing additional state
378    /// needed for incrimental computation.
379    fn to_chunk_incr(&self) -> Self::Incr;
380}