Crate mincdc

Crate mincdc 

Source
Expand description

§MinCDC

MinCDC is a very simple yet efficient content-defined chunking algorithm. This library contains a SIMD-accelerated implementation of it.

The basic idea of MinCDC is to choose chunk boundaries based on the minimum value of a sliding window over the input data. That is, if the desired chunk size is between min_size and max_size, we find some min_size <= i <= max_size such that evaluate(bytes[i - w..i]) is minimized, where w is the window size, breaking ties by choosing the earliest such i. Then we return chunk bytes[..i] and repeat the process on the remainder bytes[i..].

This crate provides two implementations of MinCDC, both with a window size of 4:

  • MinCdc4, where the evaluation function is u32::from_le_bytes(bytes[i - 4..i]), i.e. a window size of 4 bytes interpreting the bytes as a little-endian u32, and
  • MinCdcHash4, where the evaluation function is hash(u32::from_le_bytes(bytes[i - 4..i])). The hash function used is the very simple hash(x) = x.wrapping_mul(a).wrapping_add(b), for some constants a and b.

MinCdcHash4 can be slightly (~10%) slower but is far more robust and predictable, it is the recommended default.

§Usage

This library provides two chunkers:

Both chunkers take a desired minimum and maximum chunk size as well as a Cdc instance (either MinCdc4 or MinCdcHash4). Then by iterating over the chunker (or calling next() in the case of ReadChunker) you get chunks of type Chunk, which derefs to a byte slice, but also contains the offset of that chunk in the input stream.

§Examples

let data = b"Hello, world! This is an example of MinCDC chunking.";

// Chunks between 8 and 16 bytes, using MinCdcHash4.
let mut chunker = SliceChunker::new(data, 8, 16, MinCdcHash4::new());
assert_eq!(chunker.next().as_deref(), Some(&b"Hello, world"[..]));
assert_eq!(chunker.next().as_deref(), Some(&b"! This is "[..]));
assert_eq!(chunker.next().as_deref(), Some(&b"an example "[..]));
assert_eq!(chunker.next().as_deref(), Some(&b"of MinCDC chu"[..]));
// Last chunk may be smaller than min_size.
assert_eq!(chunker.next().as_deref(), Some(&b"nking."[..]));
assert!(chunker.next().is_none());

Structs§

Chunk
A chunk returned by SliceChunker or ReadChunker.
MinCdc4
An instance of MinCDC4.
MinCdcHash4
An instance of MinCDCHash4.
ReadChunker
A chunker for a reader implementing Read.
SliceChunker
A chunker for a byte slice.

Traits§

Cdc
A trait for determining splitpoints in a content-defined way.