1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
//! hash-roll provides various content defined chunking algorithms //! //! Content defined chunking (CDC) algorithms are algorithms that examine a stream of input bytes (often //! represented as a slice like `[u8]`, and provide locations within that input stream to split or //! chunk the stream into parts. //! //! CDC algorithms generally try to optimize for the following: //! //! 1. Processing speed (ie: bytes/second) //! 2. Stability in split locations even when insertions/deletions of bytes occur //! 3. Reasonable distributions of chunk lengths //! //! ## API Concepts //! //! - Configured Algorithm Instance (impliments [`Chunk`]). Normally named plainly (like //! [`Bup`]). These can be thought of as "parameters" for an algorithm. //! - Incrimental (impliments [`ChunkIncr`]). Normally named with `Incr` suffix. //! //! Because of the various ways one might use a CDC, and the different CDC algorithm //! characteristics, hash-roll provides a few ways to use them. //! //! Configured Algorithm Instances are created from the set of configuration needed for a given //! algorithm. For example, this might mean configuring a window size or how to decide where to //! split. These don't include any mutable data, in other words: they don't keep track of what data //! is given to them. Configured Algorithm Instances provide the all-at-once APIs, as well as //! methods to obtain other kinds of APIs, like incrimental style apis. //! //! ## CDC Algorithms and Window Buffering //! //! Different CDC algorithms have different constraints about how they process data. Notably, some //! require a large amount of previous processed data to process additional data. This "large //! amount of previously processed data" is typically referred to as the "window". That said, note //! that some CDC algorithms that use a window concept don't need previously accessed data. //! //! For the window-buffering algorithms, their is an extra cost to certain types of API //! implimentations. The documentation will note when these occur and suggest alternatives. //! //! Generally, CDC interfaces that are incrimental will be slower for window-buffering algorithms. //! Using an explicitly allocating interface (which emits `Vec<u8>` or `Vec<Vec<u8>>`) will have no //! worse performance that the incrimental API, but might be more convenient. Using an all-at-once //! API will provide the best performance due to not requiring any buffering (the input data can be //! used directly). //! //! ## Use Cases that drive API choices //! //! - accumulate vecs, emits vecs //! - incrimental: yes //! - input: `Vec<u8>` //! - internal state: `Vec<Vec<u8>>` //! - output: `Vec<Vec<u8>>` //! //! - stream data through //! - incrimenal: yes //! - input: `&[u8]` //! //! - mmap (or read entire) file, emit //! - incrimenal: no //! - input: `&[u8]` //! - output: `&[u8]` // # API Design Notes // // ## What methods should be in a trait? What should be in wrapper structs? // // - place methods that might have more optimized variants, but can have common implimentations, // in a trait. This notably affects window-buffering differences: it's always possible to // impliment all-at-once processing using incrimental interfaces that internally buffer, but // it's much more efficient for window-buffering algorithms to provide implimentations that know // how to look into the input data directly. #![warn(rust_2018_idioms, missing_debug_implementations)] /* TODO: Rabin-Karp * H = c_1 * a ** (k-1) + c_2 * a ** (k-2) ... + c_k * a ** 0 * where: * a is a constant * c_1, ..., c_k are the input characters * * All math is done modulo n. Choice of n & a critical * * Parameters: * - n: mululo limit * - a: a constant * * State: * H * * Application: */ /* TODO: * rollsum of librsync */ use std::mem; pub mod bup; pub mod buzhash; pub mod buzhash_table; pub mod fastcdc; pub mod gear; pub mod gear_table; pub mod gzip; pub mod mii; pub mod pigz; pub mod ram; pub mod range; pub mod zpaq; pub mod zstd; pub(crate) use range::RangeExt; /// Accept incrimental input and provide indexes of split points /// /// Compared to [`Chunk`], [`ChunkIncr`] allows avoiding having to buffer all input data in memory, /// and avoids the need to use a single buffer for storing the input data (even if all data is in /// memory). /// /// Data fed into a given [`ChunkIncr`] instance is considered to be part of the same /// data "source". This affects chunking algorithms that maintain some state between chunks /// (like `ZstdRsyncable` does). If you have multiple "sources", one should obtain new instances of /// [`ChunkIncr`] for each of them (typically via [`ToChunkIncr`]). /// /// Note that for some splitting/chunking algorithms, the incrimental api will be less efficient /// compared to the non-incrimental API. In particular, algorithms like [`Rsyncable`] that require /// the use of previously examined data to shift their "window" (resulting in needing a circular /// buffer which all inputed data passes through) will perform more poorly using [`ChunkIncr`] /// compared with non-incrimental interfaces pub trait ChunkIncr { /// The data "contained" within a implimentor of this trait is the history of all data slices /// passed to feed. /// /// In other words, all previous data (or no previous data) may be used in determining the /// point to split. /// /// Returns None if the data has no split point. /// Otherwise, returns an index in the most recently passed `data`. /// /// Note that returning the index in the current slice makes most "look-ahead" splitting /// impossible (as it is permissible to pass 1 byte at a time). fn push(&mut self, data: &[u8]) -> Option<usize>; /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker. /// /// Will always return enough slices to form the entire content of `data`, even if the trailing /// part of data is not a chunk (ie: does not end on a chunk boundary) fn iter_slices<'a>(self, data: &'a [u8]) -> IterSlices<'a, Self> where Self: std::marker::Sized, { IterSlices { rem: data, chunker: self, } } /// Given a [`ChunkIncr`] and a single slice, return a list of slices chunked by the chunker. /// Does not return the remainder (if any) in the iteration. Use [`IterSlices::take_rem()`] or /// [`IterSlices::into_parts()`] to get the remainder. /// /// Note that this is a non-incrimental interface. Calling this on an already fed chunker or using /// this multiple times on the same chunker may provide unexpected results fn iter_slices_strict<'a>(self, data: &'a [u8]) -> IterSlicesStrict<'a, Self> where Self: std::marker::Sized, { IterSlicesStrict { rem: data, chunker: self, } } } /// Returned by [`ChunkIncr::iter_slices_strict()`] /// /// Always emits _complete_ slices durring iteration. #[derive(Debug)] pub struct IterSlicesStrict<'a, C: ChunkIncr> { rem: &'a [u8], chunker: C, } impl<'a, C: ChunkIncr> IterSlicesStrict<'a, C> { /// Take the remainder from this iterator. Leaves an empty slice in it's place. pub fn take_rem(&mut self) -> &'a [u8] { let mut l: &[u8] = &[]; mem::swap(&mut self.rem, &mut l); l } /// Obtain the internals /// /// Useful, for example, after iteration stops to obtain the remaining slice. pub fn into_parts(self) -> (C, &'a [u8]) { (self.chunker, self.rem) } } impl<'a, C: ChunkIncr> Iterator for IterSlicesStrict<'a, C> { type Item = &'a [u8]; fn next(&mut self) -> Option<Self::Item> { match self.chunker.push(self.rem) { None => None, Some(l) => { let (v, rn) = self.rem.split_at(l); self.rem = rn; Some(v) } } } } /// Returned by [`ChunkIncr::iter_slices()`] /// /// When it runs out of data, it returns the remainder as the last element of the iteration #[derive(Debug)] pub struct IterSlices<'a, C: ChunkIncr> { rem: &'a [u8], chunker: C, } impl<'a, C: ChunkIncr> IterSlices<'a, C> { /// Obtain the internals /// /// Useful, for example, after iteration stops to obtain the remaining slice. pub fn into_parts(self) -> (C, &'a [u8]) { (self.chunker, self.rem) } } impl<'a, C: ChunkIncr> Iterator for IterSlices<'a, C> { type Item = &'a [u8]; fn next(&mut self) -> Option<Self::Item> { if self.rem.is_empty() { return None; } match self.chunker.push(self.rem) { None => { let v = self.rem; self.rem = &[]; Some(v) } Some(l) => { let (v, rn) = self.rem.split_at(l); self.rem = rn; Some(v) } } } } /// Impl on algorthms that define methods of chunking data /// /// This is the lowest level (but somewhat restrictive) trait for chunking algorthms. It assumes /// that the input is provided to it in a contiguous slice. If you don't have your input as a /// contiguous slice, [`ChunkIncr`] may be a better choice (it allows non-contiguous input, but may /// be slowing for some chunking algorthms). pub trait Chunk { /// `SearchState` allows searching for the chunk edge to resume without duplicating work /// already done. type SearchState; /* /// Amount of data from already emitted chunks requried for determining future chunks /// /// Indicates the amount of data that _must_ be preserved for [`find_chunk_edge()`]'s /// `prev_data` argument. If more that this is passed, the last bytes in the slice are used. At /// the start of an input (where there is no previous data), an empty slice would be used. /// /// For most chunking algorithms, this is `0` (zero), indicating that `prev_data` may always be /// an empty slice. const CARRY_LEN: usize; */ /// Provide an initial [`SearchState`] for use with [`find_chunk_edge()`]. Generally, for each /// input one should generate a new [`SearchState`]. fn to_search_state(&self) -> Self::SearchState; /// Find the next "chunk" in `data` to emit /// /// The return value is a pair of a range representing the start and end of the chunk being /// emitted, and the offset from which subsequent `data` subsets should be passed to the next /// call to `find_chunk_edge`. /// /// `state` is mutated so that it does not rexamine previously examined data, even when a chunk /// is not emitted. /// /// `data` may be extended with additional data between calls to `find_chunk_edge()`. The bytes /// that were _previously_ in `data` and are not indicated by `discard_ct` must be preserved in /// the next `data` buffer called. /// /// ```rust /// use hash_roll::Chunk; /// /// fn some_chunk() -> impl Chunk { /// hash_roll::mii::Mii::default() /// } /// /// let chunk = some_chunk(); /// let orig_data = b"hello"; /// let mut data = &orig_data[..]; /// let mut ss = chunk.to_search_state(); /// let mut prev_cut = 0; /// /// loop { /// let (chunk, discard_ct) = chunk.find_chunk_edge(&mut ss, data); /// /// match chunk { /// Some(cut_point) => { /// // map `cut_point` from the current slice back into the original slice so we can /// // have consistent indexes /// let g_cut = cut_point + orig_data.len() - data.len(); /// println!("chunk: {:?}", &orig_data[prev_cut..cut_point]); /// }, /// None => { /// println!("no chunk, done with data we have"); /// println!("remain: {:?}", &data[discard_ct..]); /// break; /// } /// } /// /// data = &data[discard_ct..]; /// } /// ``` /// /// Note: call additional times on the same `SearchState` and the required `data` to obtain /// subsequent chunks in the same input data. To handle a seperate input, use a new /// `SearchState`. /// /// Note: calling with a previous `state` with a new `data` that isn't an extention of the /// previous `data` will result in split points that may not follow the design of the /// underlying algorithm. Avoid relying on consistent cut points to reason about memory safety. /// // NOTE: the reason that we preserve `state` even when chunks are emitted is that some // algorthims require some state to pass between chunks for a given input. zstd includes an // example of an algorithm that needs this // // Potential pitfal: for better performance, keeping the return value small is a very good // idea. By returning ~2x64+32, we are might be less performant depending on the ABI selected. // // Consider if result should return `(&[u8], &[u8])` instead of an index (which would then be // given to `.split_at()` // // Consider if `state` should have a `reset()` method to avoid reallocating // // API: // - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<(usize, uszie)>, usize); // - Problem: unclear what indexes of slices represent: start can't be in the data being // passed because we don't require `data` include the start of the chunk // - `fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<usize>, usize); // - Problem: user code to track indexing match up is somewhat difficult // - mostly due to needing an extra index to track to handle the "last chunk" location not // being the "slice we need to pass start" fn find_chunk_edge(&self, state: &mut Self::SearchState, data: &[u8]) -> (Option<usize>, usize); } /// Implimented on types which can be converted to/can provide a [`ChunkIncr`] interface. /// /// Types that impliment this generally represent a instantiation of a chunking algorithm. // NOTE: we use this instead of just having `From<&C: Chunk> for CI: ChunkIncr` because there is // _one_ `ChunkIncr` for each `Chunk`, and rust can't infer that when using a `From` or `Into` // bound. // // We could consider adding `type Incr` into `trait Chunk`, or only having `type Incr` pub trait ToChunkIncr { /// `Incr` provides the incrimental interface to this chunking instance type Incr: ChunkIncr; /// `to_chunk_incr()` returns a [`ChunkIncr`] which can be incrimentally fed data and emits /// chunks. /// /// Generally, this is a typically low cost operation that copies from the implimentor or does /// minor computation on its fields and may allocate some memory for storing additional state /// needed for incrimental computation. fn to_chunk_incr(&self) -> Self::Incr; }