hash_roll/lib.rs
1//! hash-roll provides various content defined chunking algorithms
2//!
3//! Content defined chunking (CDC) algorithms are algorithms that examine a stream of input bytes (often
4//! represented as a slice like `[u8]`, and provide locations within that input stream to split or
5//! chunk the stream into parts.
6//!
7//! CDC algorithms generally try to optimize for the following:
8//!
9//! 1. Processing speed (ie: bytes/second)
10//! 2. Stability in split locations even when insertions/deletions of bytes occur
11//! 3. Reasonable distributions of chunk lengths
12//!
13//! ## API Concepts
14//!
15//! - Configured Algorithm Instance (impliments [`Chunk`]). Normally named plainly (like
16//! [`Bup`]). These can be thought of as "parameters" for an algorithm.
17//! - Incrimental (impliments [`ChunkIncr`]). Normally named with `Incr` suffix.
18//!
19//! Because of the various ways one might use a CDC, and the different CDC algorithm
20//! characteristics, hash-roll provides a few ways to use them.
21//!
22//! Configured Algorithm Instances are created from the set of configuration needed for a given
23//! algorithm. For example, this might mean configuring a window size or how to decide where to
24//! split. These don't include any mutable data, in other words: they don't keep track of what data
25//! is given to them. Configured Algorithm Instances provide the all-at-once APIs, as well as
26//! methods to obtain other kinds of APIs, like incrimental style apis.
27//!
28//! ## CDC Algorithms and Window Buffering
29//!
30//! Different CDC algorithms have different constraints about how they process data. Notably, some
31//! require a large amount of previous processed data to process additional data. This "large
32//! amount of previously processed data" is typically referred to as the "window". That said, note
33//! that some CDC algorithms that use a window concept don't need previously accessed data.
34//!
35//! For the window-buffering algorithms, their is an extra cost to certain types of API
36//! implimentations. The documentation will note when these occur and suggest alternatives.
37//!
38//! Generally, CDC interfaces that are incrimental will be slower for window-buffering algorithms.
39//! Using an explicitly allocating interface (which emits `Vec<u8>` or `Vec<Vec<u8>>`) will have no
40//! worse performance that the incrimental API, but might be more convenient. Using an all-at-once
41//! API will provide the best performance due to not requiring any buffering (the input data can be
42//! used directly).
43//!
44//! ## Use Cases that drive API choices
45//!
46//! - accumulate vecs, emits vecs
47//! - incrimental: yes
48//! - input: `Vec<u8>`
49//! - internal state: `Vec<Vec<u8>>`
50//! - output: `Vec<Vec<u8>>`
51//!
52//! - stream data through
53//! - incrimenal: yes
54//! - input: `&[u8]`
55//!
56//! - mmap (or read entire) file, emit
57//! - incrimenal: no
58//! - input: `&[u8]`
59//! - output: `&[u8]`
60
61// # API Design Notes
62//
63// ## What methods should be in a trait? What should be in wrapper structs?
64//
65// - place methods that might have more optimized variants, but can have common implimentations,
66// in a trait. This notably affects window-buffering differences: it's always possible to
67// impliment all-at-once processing using incrimental interfaces that internally buffer, but
68// it's much more efficient for window-buffering algorithms to provide implimentations that know
69// how to look into the input data directly.
70
71#![warn(rust_2018_idioms, missing_debug_implementations)]
72/* TODO: Rabin-Karp
73 * H = c_1 * a ** (k-1) + c_2 * a ** (k-2) ... + c_k * a ** 0
74 * where:
75 * a is a constant
76 * c_1, ..., c_k are the input characters
77 *
78 * All math is done modulo n. Choice of n & a critical
79 *
80 * Parameters:
81 * - n: mululo limit
82 * - a: a constant
83 *
84 * State:
85 * H
86 *
87 * Application:
88 */
89
90/* TODO:
91 * rollsum of librsync
92 */
93
94use std::mem;
95
96pub mod bup;
97pub mod buzhash;
98pub mod buzhash_table;
99pub mod fastcdc;
100pub mod gear;
101pub mod gear_table;
102pub mod gzip;
103pub mod mii;
104pub mod pigz;
105pub mod ram;
106pub mod range;
107pub mod zpaq;
108pub mod zstd;
109
110pub(crate) use range::RangeExt;
111
112/// Accept incrimental input and provide indexes of split points
113///
114/// Compared to [`Chunk`], [`ChunkIncr`] allows avoiding having to buffer all input data in memory,
115/// and avoids the need to use a single buffer for storing the input data (even if all data is in
116/// memory).
117///
118/// Data fed into a given [`ChunkIncr`] instance is considered to be part of the same
119/// data "source". This affects chunking algorithms that maintain some state between chunks
120/// (like `ZstdRsyncable` does). If you have multiple "sources", one should obtain new instances of
121/// [`ChunkIncr`] for each of them (typically via [`ToChunkIncr`]).
122///
123/// Note that for some splitting/chunking algorithms, the incrimental api will be less efficient
124/// compared to the non-incrimental API. In particular, algorithms like [`Rsyncable`] that require
125/// the use of previously examined data to shift their "window" (resulting in needing a circular
126/// buffer which all inputed data passes through) will perform more poorly using [`ChunkIncr`]
127/// compared with non-incrimental interfaces
128pub trait ChunkIncr {
129 /// The data "contained" within a implimentor of this trait is the history of all data slices
130 /// passed to feed.
131 ///
132 /// In other words, all previous data (or no previous data) may be used in determining the
133 /// point to split.
134 ///
135 /// Returns None if the data has no split point.
136 /// Otherwise, returns an index in the most recently passed `data`.
137 ///
138 /// Note that returning the index in the current slice makes most "look-ahead" splitting
139 /// impossible (as it is permissible to pass 1 byte at a time).
140 fn push(&mut self, data: &[u8]) -> Option<usize>;
141
142
143 /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker.
144 ///
145 /// Will always return enough slices to form the entire content of `data`, even if the trailing
146 /// part of data is not a chunk (ie: does not end on a chunk boundary)
147 fn iter_slices<'a>(self, data: &'a [u8]) -> IterSlices<'a, Self>
148 where
149 Self: std::marker::Sized,
150 {
151 IterSlices {
152 rem: data,
153 chunker: self,
154 }
155 }
156
157 /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker.
158 /// Does not return the remainder (if any) in the iteration. Use [`IterSlices::take_rem()`] or
159 /// [`IterSlices::into_parts()`] to get the remainder.
160 ///
161 /// Note that this is a non-incrimental interface. Calling this on an already fed chunker or using
162 /// this multiple times on the same chunker may provide unexpected results
163 fn iter_slices_strict<'a>(self, data: &'a [u8]) -> IterSlicesStrict<'a, Self>
164 where
165 Self: std::marker::Sized,
166 {
167 IterSlicesStrict {
168 rem: data,
169 chunker: self,
170 }
171 }
172}
173
174/// Returned by [`ChunkIncr::iter_slices_strict()`]
175///
176/// Always emits _complete_ slices durring iteration.
177#[derive(Debug)]
178pub struct IterSlicesStrict<'a, C: ChunkIncr> {
179 rem: &'a [u8],
180 chunker: C,
181}
182
183impl<'a, C: ChunkIncr> IterSlicesStrict<'a, C> {
184 /// Take the remainder from this iterator. Leaves an empty slice in it's place.
185 pub fn take_rem(&mut self) -> &'a [u8] {
186 let mut l: &[u8] = &[];
187 mem::swap(&mut self.rem, &mut l);
188 l
189 }
190
191 /// Obtain the internals
192 ///
193 /// Useful, for example, after iteration stops to obtain the remaining slice.
194 pub fn into_parts(self) -> (C, &'a [u8]) {
195 (self.chunker, self.rem)
196 }
197}
198
199impl<'a, C: ChunkIncr> Iterator for IterSlicesStrict<'a, C> {
200 type Item = &'a [u8];
201
202 fn next(&mut self) -> Option<Self::Item> {
203 match self.chunker.push(self.rem) {
204 None => None,
205 Some(l) => {
206 let (v, rn) = self.rem.split_at(l);
207 self.rem = rn;
208 Some(v)
209 }
210 }
211 }
212}
213
214/// Returned by [`ChunkIncr::iter_slices()`]
215///
216/// When it runs out of data, it returns the remainder as the last element of the iteration
217#[derive(Debug)]
218pub struct IterSlices<'a, C: ChunkIncr> {
219 rem: &'a [u8],
220 chunker: C,
221}
222
223impl<'a, C: ChunkIncr> IterSlices<'a, C> {
224 /// Obtain the internals
225 ///
226 /// Useful, for example, after iteration stops to obtain the remaining slice.
227 pub fn into_parts(self) -> (C, &'a [u8]) {
228 (self.chunker, self.rem)
229 }
230}
231
232impl<'a, C: ChunkIncr> Iterator for IterSlices<'a, C> {
233 type Item = &'a [u8];
234
235 fn next(&mut self) -> Option<Self::Item> {
236 if self.rem.is_empty() {
237 return None;
238 }
239
240 match self.chunker.push(self.rem) {
241 None => {
242 let v = self.rem;
243 self.rem = &[];
244 Some(v)
245 }
246 Some(l) => {
247 let (v, rn) = self.rem.split_at(l);
248 self.rem = rn;
249 Some(v)
250 }
251 }
252 }
253}
254
255/// Impl on algorthms that define methods of chunking data
256///
257/// This is the lowest level (but somewhat restrictive) trait for chunking algorthms. It assumes
258/// that the input is provided to it in a contiguous slice. If you don't have your input as a
259/// contiguous slice, [`ChunkIncr`] may be a better choice (it allows non-contiguous input, but may
260/// be slowing for some chunking algorthms).
261pub trait Chunk {
262 /// `SearchState` allows searching for the chunk edge to resume without duplicating work
263 /// already done.
264 type SearchState;
265
266 /*
267 /// Amount of data from already emitted chunks requried for determining future chunks
268 ///
269 /// Indicates the amount of data that _must_ be preserved for [`find_chunk_edge()`]'s
270 /// `prev_data` argument. If more that this is passed, the last bytes in the slice are used. At
271 /// the start of an input (where there is no previous data), an empty slice would be used.
272 ///
273 /// For most chunking algorithms, this is `0` (zero), indicating that `prev_data` may always be
274 /// an empty slice.
275 const CARRY_LEN: usize;
276 */
277
278 /// Provide an initial [`SearchState`] for use with [`find_chunk_edge()`]. Generally, for each
279 /// input one should generate a new [`SearchState`].
280 fn to_search_state(&self) -> Self::SearchState;
281
282 /// Find the next "chunk" in `data` to emit
283 ///
284 /// The return value is a pair of a range representing the start and end of the chunk being
285 /// emitted, and the offset from which subsequent `data` subsets should be passed to the next
286 /// call to `find_chunk_edge`.
287 ///
288 /// `state` is mutated so that it does not rexamine previously examined data, even when a chunk
289 /// is not emitted.
290 ///
291 /// `data` may be extended with additional data between calls to `find_chunk_edge()`. The bytes
292 /// that were _previously_ in `data` and are not indicated by `discard_ct` must be preserved in
293 /// the next `data` buffer called.
294 ///
295 /// ```rust
296 /// use hash_roll::Chunk;
297 ///
298 /// fn some_chunk() -> impl Chunk {
299 /// hash_roll::mii::Mii::default()
300 /// }
301 ///
302 /// let chunk = some_chunk();
303 /// let orig_data = b"hello";
304 /// let mut data = &orig_data[..];
305 /// let mut ss = chunk.to_search_state();
306 /// let mut prev_cut = 0;
307 ///
308 /// loop {
309 /// let (chunk, discard_ct) = chunk.find_chunk_edge(&mut ss, data);
310 ///
311 /// match chunk {
312 /// Some(cut_point) => {
313 /// // map `cut_point` from the current slice back into the original slice so we can
314 /// // have consistent indexes
315 /// let g_cut = cut_point + orig_data.len() - data.len();
316 /// println!("chunk: {:?}", &orig_data[prev_cut..cut_point]);
317 /// },
318 /// None => {
319 /// println!("no chunk, done with data we have");
320 /// println!("remain: {:?}", &data[discard_ct..]);
321 /// break;
322 /// }
323 /// }
324 ///
325 /// data = &data[discard_ct..];
326 /// }
327 /// ```
328 ///
329 /// Note: call additional times on the same `SearchState` and the required `data` to obtain
330 /// subsequent chunks in the same input data. To handle a seperate input, use a new
331 /// `SearchState`.
332 ///
333 /// Note: calling with a previous `state` with a new `data` that isn't an extention of the
334 /// previous `data` will result in split points that may not follow the design of the
335 /// underlying algorithm. Avoid relying on consistent cut points to reason about memory safety.
336 ///
337 // NOTE: the reason that we preserve `state` even when chunks are emitted is that some
338 // algorthims require some state to pass between chunks for a given input. zstd includes an
339 // example of an algorithm that needs this
340 //
341 // Potential pitfal: for better performance, keeping the return value small is a very good
342 // idea. By returning ~2x64+32, we are might be less performant depending on the ABI selected.
343 //
344 // Consider if result should return `(&[u8], &[u8])` instead of an index (which would then be
345 // given to `.split_at()`
346 //
347 // Consider if `state` should have a `reset()` method to avoid reallocating
348 //
349 // API:
350 // - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<(usize, uszie)>, usize);
351 // - Problem: unclear what indexes of slices represent: start can't be in the data being
352 // passed because we don't require `data` include the start of the chunk
353 // - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<usize>, usize);
354 // - Problem: user code to track indexing match up is somewhat difficult
355 // - mostly due to needing an extra index to track to handle the "last chunk" location not
356 // being the "slice we need to pass start"
357 fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8])
358 -> (Option<usize>, usize);
359}
360
361/// Implimented on types which can be converted to/can provide a [`ChunkIncr`] interface.
362///
363/// Types that impliment this generally represent a instantiation of a chunking algorithm.
364// NOTE: we use this instead of just having `From<&C: Chunk> for CI: ChunkIncr` because there is
365// _one_ `ChunkIncr` for each `Chunk`, and rust can't infer that when using a `From` or `Into`
366// bound.
367//
368// We could consider adding `type Incr` into `trait Chunk`, or only having `type Incr`
369pub trait ToChunkIncr {
370 /// `Incr` provides the incrimental interface to this chunking instance
371 type Incr: ChunkIncr;
372
373 /// `to_chunk_incr()` returns a [`ChunkIncr`] which can be incrimentally fed data and emits
374 /// chunks.
375 ///
376 /// Generally, this is a typically low cost operation that copies from the implimentor or does
377 /// minor computation on its fields and may allocate some memory for storing additional state
378 /// needed for incrimental computation.
379 fn to_chunk_incr(&self) -> Self::Incr;
380}