Expand description
§MinCDC
MinCDC is a very simple yet efficient content-defined chunking algorithm. This library contains a SIMD-accelerated implementation of it.
The basic idea of MinCDC is to choose chunk boundaries based on the minimum
value of a sliding window over the input data. That is, if the desired
chunk size is between min_size and max_size, we find some
min_size <= i <= max_size such that evaluate(bytes[i - w..i]) is
minimized, where w is the window size, breaking ties by choosing the
earliest such i. Then we return chunk bytes[..i] and repeat the process
on the remainder bytes[i..].
This crate provides two implementations of MinCDC, both with a window size of 4:
MinCdc4, where the evaluation function isu32::from_le_bytes(bytes[i - 4..i]), i.e. a window size of 4 bytes interpreting the bytes as a little-endianu32, andMinCdcHash4, where the evaluation function ishash(u32::from_le_bytes(bytes[i - 4..i])). The hash function used is the very simplehash(x) = x.wrapping_mul(a).wrapping_add(b), for some constantsaandb.
MinCdcHash4 can be slightly (~10%) slower but is far more robust and
predictable, it is the recommended default.
§Usage
This library provides two chunkers:
SliceChunkerfor chunking a byte slice, andReadChunkerfor chunking a reader implementingRead.
Both chunkers take a desired minimum and maximum chunk size as well as a
Cdc instance (either MinCdc4 or
MinCdcHash4). Then by iterating over the chunker
(or calling next() in the case of ReadChunker) you get chunks of type
Chunk, which derefs to a byte slice, but also contains the offset of
that chunk in the input stream.
§Examples
let data = b"Hello, world! This is an example of MinCDC chunking.";
// Chunks between 8 and 16 bytes, using MinCdcHash4.
let mut chunker = SliceChunker::new(data, 8, 16, MinCdcHash4::new());
assert_eq!(chunker.next().as_deref(), Some(&b"Hello, world"[..]));
assert_eq!(chunker.next().as_deref(), Some(&b"! This is "[..]));
assert_eq!(chunker.next().as_deref(), Some(&b"an example "[..]));
assert_eq!(chunker.next().as_deref(), Some(&b"of MinCDC chu"[..]));
// Last chunk may be smaller than min_size.
assert_eq!(chunker.next().as_deref(), Some(&b"nking."[..]));
assert!(chunker.next().is_none());Structs§
- Chunk
- A chunk returned by
SliceChunkerorReadChunker. - MinCdc4
- An instance of MinCDC4.
- MinCdc
Hash4 - An instance of MinCDCHash4.
- Read
Chunker - A chunker for a reader implementing
Read. - Slice
Chunker - A chunker for a byte slice.
Traits§
- Cdc
- A trait for determining splitpoints in a content-defined way.