hash-roll 0.3.0

Rolling hashes & Content Defined Chunking (cdc)
Documentation

# Algorithms

 - Gear
 - FastCDC
 - Zpaq
 - AE
 - RollSum
 - BuzHash
 - LMC (chunking)
 - RAM Chunking (Rapid  Asymmetric  Maximum)
   - doi:10.1016/j.future.2017.02.013 
 - MII (minimal incrimental interval)
   - doi:10.1109/access.2019.2926195 
 - [TTTD]https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1041&context=etd_projects
 - [FBC]doi:10.1109/mascots.2010.37

# Algorithm Points of Comparison

 - Ability to constrain block size
   - distribution
   - tuneability of distribution
 - Speed
   - on different distributions
 - Common chunk discovery
   - on different distributions
 - Common chuck discovery after a byte shift
   - on different distributions
 - Common chuck discovery after edit
   - on different data distributions
   - under different edit kinds

# Impl Features

 - Incrimental input: rather than require a single `&[u8]` up front, allow
   providing a number of `&[u8]`s over the life of the splitter/hasher.

 - Slice input vs byte-at-a-time: By allowing algorithms to take in larger
   slices of data at a time, it enables them to potentially impliment
   optimizations to speed up computation.

# Implimentations

 - [cdc]https://lib.rs/crates/cdc
   - latest release: 2017-09-09
     - inactive development (as of 2020-06-21)
   - algorithm(s): "Rabin64" (polynomial based, 64-bit)
   - incrimental input: no
     - no documentation indicates incrimental input is possible
     - while one could use a special impl of `Iterator<Item=u8>` that can be
       extended, this would only work if the `SeperatorIter` or `ChunkIter` had
       not emitted a final incomplete chunk/seperator.
   - includes `RollingHash64` trait
   - structure includes mutable context, no non-mutable representation
   - input format(s): `Iterator<Item=u8>`, `u8`
     - may limit performance capability
   - input is fully buffered by cdc structures
   - provides both rolling hash and content splitting features
   - has _explicit_ representation for "prefilling" of the rolling hash.
   - includes multiple iterator adapters
     - splits the concept of a "seperator" (index + hash) vs a "chunk" (index +
       hash + size).
     - iterator adaptors don't generalize over rolling hashes, they are
       hard-coded to the `Rabin64` impl
   - documentation is lacking (almost universally missing)
 - [fastcdc]https://lib.rs/crates/fastcdc
   - latest release: 2020-03-19, v1.0.3
     - active development (as of 2020-06-21)
   - algorithm(s): FastCDC
   - incrimental input: no
   - api:
     - input: one `&[u8]`
     - output: `Iterator<Item=Chunk> where Chunk: (offset: usize, size:
       usize)`. Returns the remaining chunk as the last item even if not
       considered a complete chunk.
   - only struct mixes mutable and immutable data, no configuration representation
   - "chunks" are an offset and a size
     - iow: no rolling hash support
   - single struct, no traits
   - provides a fixed table for fastcdc (generated via a reproducable mechanism initially)
 - [quickcdc]https://lib.rs/crates/quickcdc
   - latest release: 2018-12-17 v1.0.0 (no other releases)
     - inactive development (as of 2020-06-21)
   - algorithm(s): AE (with modifications/extensions)
   - incrimental input: no
   - api:
     - input: one `&[u8]`
     - output: `Iterator<Item=&[u8]>`
   - no struct representation of configuration (only mixes mutable and immutable)
   - api: iterator over slices
   - single struct, no traits
   - includes improper use of unsafe in a non-public function (passes pointers
     into a function that dereferences them but the function is not marked
     unsafe).
 - [gearhash]https://lib.rs/crates/gearhash
   - latest release: 2020-04-12 v0.1.3
     - active development (as of 2020-06-21)
   - algorithm(s): gear
   - incrimental input: yes
   - provides simd & scalar impls
   - includes a static table for gearhash
   - api: call `next_match()` repeatedly with new slices. Returns a
     `Option<usize>` indicating where a split point is (if any) in the slice
     passed to `next_match()`.
   - `Hasher` struct provides both content splitting and rolling hash features.
     - in-place splitting
     - lacks helpers present in `cdchunking`
   - single struct, no traits
   - no struct representation of configuration (only mixes mutable and immutable data)
 - [cdchunking]https://lib.rs/crates/cdchunking
   - latest release: 2019-11-02 v1.0.0
     - inactive development (as of 2020-06-21)
   - algorithm(s): Zpaq
   - provides a chunker-impl trait
   - api: call `next_boundary()` repeatedly with new slices. Returns a
     `Option<usize>` indicating what a split point is (if any) in the slice.
     - must explicitly call a `reset()` after a match to reset internal state
       for subsequent matches.
   - provides a `Chunker` which takes a `ChunkerImpl` and provides a number of ease-of-use apis:
     - from a `Read` into a `Iterator<Item=Result<Vec<u8>>>`
     - from a `Read` into a `Result<Vec<Vec<u8>>>`
     - from a `Read` into a series of one of `Data(&[u8])` or `End`, where the
       `Data(&[u8])` are references to an internal buffer and `End` indicate
       the end of a chunk.
     - from a `Read` to an iterator of (start, len, end) (ie: no data returned)
     - from a `&[u8]` to an `Iterator<Item=[u8]>`
 - [rollsum]https://lib.rs/crates/rollsum
   - latest release: 2016-05-30 v0.2.1
     - inactive development (as of 2020-06-21)
   - algorithm(s): rollsum (based on bupsplit, based on rsync chunking)
   - low level trait has byte-by-byte and slice based interfaces
   - exposes conditionality of chunk edge (ie: like a rolling-sum) in trait,
     but provides a helper on the specific struct that uses it's defaults.
   - requires explicit state resets after finding a chunk edge to find the next
     chunk edge (doesn't reset internal state)
   - api: call `find_chunk_edge()` with different slices until Some(usize) is
     returned. the `usize` here is the offset after the end of the chunk (ie:
     start of the next chunk).
 - [rededup-cdc]https://lib.rs/crates/rdedup-cdc
   - `rollsum` fork
 - [bitar]https://lib.rs/crates/bitar
   - latest release: 2020-06-09 v0.7.0
     - active development (as of 2020-06-21)
   - algorithms(s): BuzHash, RollSum
   - uses enum to abstract over algorithms (`Config` and `Chunker`)
   - includes seperate immutable "configuration object" concept (`Config`)
   - supports/requires use of `tokio::AsyncRead` as input
   - api: provide a `AsyncRead` when constructing the `Chunker`. Use the
     `futures::Stream<Item=Result<(u64, Bytes)>>` it returns
   - low-level trait for each hash is byte-at-a-time
   - many other items included in the library (designed to support the cmdline tool `bita`)
 - [zvault]https://github.com/dswd/zvault
   - algorithm(s): AE, fastcdc, rabin, fixed (non content defined)
   - low level trait requires a Read & a Write instance
   - provides run-time generic over creation & extraction of some details (`Chunker`)
   - Instantiation for each provides a seed and average size
   - inactive development (last change 2018-03-08 (as of 2020-05-10))
   - includes many non-chunking items