Crate fastcdc_alt

source ·
Expand description

FastCDC-Alt

This crate is a fork of the original Rust implementation by Nathan Fiedler. (nlfiedler/fastcdc-rs)
Included here is the enhanced 2020 “FastCDC” content defined chunking algorithm described by Wen Xia, et al.

This fork introduces an adjusted and a bit more complicated alternative API to increase the flexibility and reduce the memory overhead.
The cut points produced by this fork are identical to the ones produced by the original crate.

The README and all docs are adapted to the adjusted API.

To learn more about content defined chunking and its applications, see FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication from 2016, as well as the subsequent improvements described in the paper from 2020

Examples

A short example of using the fast chunker is shown below:

use std::fs;
use fastcdc_alt::FastCDC;
let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let mut chunker = FastCDC::new(4096, 16384, 65535).unwrap();

for chunk in chunker.as_iterator(&contents) {
    println!("offset={} size={}", chunk.offset, chunk.get_length());
}

The example above is using normalized chunking level 1 as described in section 3.5 of the 2020 paper. To use a different level of chunking normalization, replace new with new_advanced as shown below:

use std::fs;
use fastcdc_alt::{FastCDC, Normalization};
let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let mut chunker = FastCDC::new_advanced(8192, 16384, 32768, Normalization::Level3, None).unwrap();
for chunk in chunker.as_iterator(&contents) {
    println!("offset={} size={}", chunk.offset, chunk.get_length());
}

Notice that the minimum and maximum chunk sizes were changed in the example using the maximum normalized chunking level. This is due to the behavior of normalized chunking in which the generated chunks tend to be closer to the expected chunk size. It is not necessary to change the min/max values, just something of which to be aware. With lower levels of normalized chunking, the size of the generated chunks will vary more. See the documentation of the Normalization enum for more detail as well as the FastCDC paper.

Minimum and Maximum

The values you choose for the minimum and maximum chunk sizes will depend on the input data to some extent, as well as the normalization level described above. Depending on your application, you may want to have a wide range of chunk sizes in order to improve the overall deduplication ratio.

Note that changing the minimum chunk size will almost certainly result in different cut points. It is best to pick a minimum chunk size for your application that can remain relevant indefinitely, lest you produce different sets of chunks for the same data.

Similarly, setting the maximum chunk size to be too small may result in cut points that were determined by the maximum size rather than the data itself. Ideally you want cut points that are determined by the input data. However, this is application dependent and your situation may be different.

Large Data

If processing very large files, the streaming version of the chunker may be a suitable approach. It allocate a byte vector equal to the maximum chunk size, draining and resizing the vector as chunks are found. However, using a crate such as memmap2 can be significantly faster than the streaming chunker. See the examples in the examples directory for how to use the streaming versions as-is, versus the non-streaming chunkers which read from a memory-mapped file. Also consider directly leveraging the cut() method of the FastCDC struct to manually implement a streaming functionality.

Re-exports

Modules

  • This module implements the canonical FastCDC algorithm as described in the paper by Wen Xia, et al., in 2020.

    The algorithm incorporates a simplified hash judgement using the fast Gear hash, sub-minimum chunk cut-point skipping, normalized chunking to produce chunks of a more consistent length, and “rolling two bytes each time”.

    According to the authors, this should be 30-40% faster than the 2016 version while producing the same cut points. Benchmarks on several large files on an Apple M1 show about a 20% improvement, but results may vary depending on CPU architecture, file size, chunk size, etc.

    There are two ways in which to use the FastCDC struct defined in this module. One is to tell expected content length to the struct using set_content_length() and subsequently invoke cut() with buffers util all data is chunked. The other is to use the as_iterator() method to get a FastCDCIterator that yields Chunk structs.

    Note that the Chunk struct provides a 64-bit hash of the chunk, which may be useful in scenarios involving chunk size prediction using historical data, such as in RapidCDC or SuperCDC. While this value has rather low entropy, it is computationally cost-free and can be put to some use with additional record keeping.

    The StreamCDC implementation is similar to FastCDC except that it will read data from a Read into an internal buffer of max_size and produce (Vec<u8>, Chunk) tuples from the Iterator.