bao

Repo — Docs — Crate

bao (pronounced "bough") is a general purpose tree hash for files. Tree hashes have two big benefits over regular serial hashes:

Parallelism. A regular hash can only occupy one CPU core, but a tree hash with enough input can spread the work over any number of cores.
Streaming verification. The only way to verify that a regular file matches a hash is to download the entire file. A tree hash makes it efficient to stream verified bytes out of a file, or to consume arbitrary pieces of a file BitTorrent-style.

bao hash is quite fast. The underlying hash function is BLAKE2b, with an AVX2 implementation provided by blake2b_simd. The input gets memory mapped and then split among worker threads with rayon. On the i5-8250U processor in my laptop, it hashes a 1 gigabyte file in 0.25 seconds, or 4 GB/s including startup and IO. By comparison the fastest Coreutils hash, sha1sum, takes 1.32 seconds. Large input in-memory benchmarks on an AWS m5.24xlarge instance (96 cores) measure 60 GB/s throughput. (That's 61% of the per-core throughput of BLAKE2b, mostly due to CPU frequency scaling.) When input is piped and memory mapping isn't possible, bao hash falls back to a single-threaded streaming implementation.

bao encode copies its input and produces an encoded file with a small header and subtree hashes interspersed throughout, currently 1.5% larger than the original. bao hash --encoded can quickly extract the root hash from the encoded file, the same result as running bao hash on the original content. Given that hash, bao decode will stream verified content bytes from the encoded file, with an optional --start offset. decode can read from a pipe or a socket, unless --start is used, in which case it needs to be able to seek.

bao slice takes a start offset and a byte count and extracts the parts of an encoded file needed to read just those bytes. bao decode-slice takes the same offset and count plus the same hash that decode uses, reads in the output of slice, and outputs the specified range of content bytes. While slice and decode --start=... require seeking, and so generally require a full encoded file on disk, decode-slice doesn't need to seek and can stream from a pipe or a socket like a regular decode. Note that bao hash --encoded can hash an extracted slice just like a full encoded file.

The encode, decode, slice, and hash commands all support an --outboard flag. This mode reads or writes tree data in a separate file apart from the input bytes, so that you can keep the unmodified input file without taking up twice as much disk space. The total size of an input file plus an outboard tree file is the same as the size of an encoded file in the default combined mode. Note that if you slice the entire input (using the slice parameters start=0 and len=size), the result is exactly the same as an entire combined-mode encoding, so slice can be an efficient way of converting from outboard to combined without re-hashing or writing to disk.

You can build the bao binary from the bao_bin sub-crate, like this:

git clone https://github.com/oconnor663/bao
cd bao/bao_bin
cargo build --release
./target/release/bao --help

test/bao.py includes a Python implementation designed to be as short and readable as possible.

There is no_std support if you set default-features = false in your Cargo.toml. Most of the standalone functions that don't obviously depend on std are available. For example, encode::encode is available with a single threaded implementation, but encode::encode_to_vec isn't avialable. Of the streaming implementations, only hash::Writer is available, because the encoding and decoding implementations rely more on the std::io traits. If there are any callers that want to do streaming encoding or decoding under no_std, please let me know, and we can figure out which libcore-compatible IO interfaces it makes sense to use.

bao 0.3.0

bao