Module fastcdc::v2020

Expand description

This module implements the canonical FastCDC algorithm as described in the paper by Wen Xia, et al., in 2020.

The algorithm incorporates a simplified hash judgement using the fast Gear hash, sub-minimum chunk cut-point skipping, normalized chunking to produce chunks of a more consistent length, and “rolling two bytes each time”. According to the authors, this should be 30-40% faster than the 2016 version while producing the same cut points. Benchmarks on several large files on an Apple M1 show about a 20% improvement, but results may vary depending on CPU architecture, file size, chunk size, etc.

There are two ways in which to use the FastCDC struct defined in this module. One is to simply invoke cut() while managing your own start and remaining values. The other is to use the struct as an Iterator that yields Chunk structs which represent the offset and size of the chunks. Note that attempting to use both cut() and Iterator on the same FastCDC instance will yield incorrect results.

Note that the cut() function returns the 64-bit hash of the chunk, which may be useful in scenarios involving chunk size prediction using historical data, such as in RapidCDC or SuperCDC. This hash value is also given in the hash field of the Chunk struct. While this value has rather low entropy, it is computationally cost-free and can be put to some use with additional record keeping.

The StreamCDC implementation is similar to FastCDC except that it will read data from a Read into an internal buffer of max_size and produce ChunkData values from the Iterator.

Structs

Chunk
Represents a chunk returned from the FastCDC iterator.
ChunkData
Represents a chunk returned from the StreamCDC iterator.
FastCDC
The FastCDC chunker implementation from 2020.
StreamCDC
The FastCDC chunker implementation from 2020 with streaming support.

Enums

Error
The error type returned from the StreamCDC iterator.
Normalization
The level for the normalized chunking used by FastCDC.

Constants

AVERAGE_MAX
Largest acceptable value for the average chunk size.
AVERAGE_MIN
Smallest acceptable value for the average chunk size.
MAXIMUM_MAX
Largest acceptable value for the maximum chunk size.
MAXIMUM_MIN
Smallest acceptable value for the maximum chunk size.
MINIMUM_MAX
Largest acceptable value for the minimum chunk size.
MINIMUM_MIN
Smallest acceptable value for the minimum chunk size.