Bao

Spec — Rust Crate — Rust Docs
Caution! Not yet suitable for production use. The output of Bao isn't stable. There might be more changes before 1.0.
Bao (rhymes with bough 🌳) is a general-purpose cryptographic tree hash. Here's the full specification and a recorded talk about it. Unlike a serial hash function, a tree hash allows the implementation to work on different parts of the input in parallel. On modern hardware with multiple cores and SIMD instructions, that makes a dramatic difference in performance:

Bao also performs well on short messages and 32-bit systems. It's based on the BLAKE2s hash function, which was designed for those use cases:
Encoded files
Apart from high performance, a tree hash also lets you verify part of a file without re-hashing the entire thing. Bao defines an encoding format, which stores all the nodes of a hash tree interleaved with their input. Clients can stream this encoding, or do random seeks into it, while verifying that every byte they read matches the root hash.
Use case: A secure messaging app might support attachment files by including the hash of an attachment in a message. With a serial hash, the recipient would need to download an entire attachment to verify it, but that's impractical for e.g. large video files. With a tree hash like Bao, the recipient can stream a video, while still verifying each byte as it comes in. (This scenario was in fact the original motivation for the Bao project.)
# Create an input file that's a megabyte of random data.
> head
# Convert it into a Bao encoded file.
> bao
# Compare the size of the two files. The encoding overhead is small.
> stat |
# Compute the hash of the original file.
> hash=
# Stream decoded bytes from the encoded file, using the hash above.
> bao
> cmp
# Observe that using the wrong hash to decode results in an error. This
# is also what will happen if we use the right hash but corrupt some
# bytes in the encoded file.
> bad_hash="0000000000000000000000000000000000000000000000000000000000000000"
> bao
Encoded slices
Encoded files support random seeking, but seeking might not be available or efficent over a network. (Note that one seek in the content usually requires several seeks in the encoding, as the decoder traverses the hash tree level-by-level.) In these situations, rather than e.g. hacking a seek interface into your HTTP client, you can instead request an encoded slice. A slice contains some part of the original file and the parts of the hash tree needed to verify it. Creating a slice requires seeking over the full encoding, but the recipient can then stream the slice without needing to seek. Decoding a slice uses the same root hash as regular decoding, so it doesn't require any preparation in advance from the sender or the recipient.
Use case: A BitTorrent-like application could fetch different slices of a file from different peers, without needing to define the slices ahead of time. Or a distributed file storage application could request random slices of an archived file from its storage providers, to prove that they're honestly storing the file, without needing to prepare or store challenges for the future.
# Using the encoded file from above, extract a 100 KB slice from
# somewhere in the middle. We'll use start=500000 (500 KB) and
# count=100000 (100 KB).
> bao
# Look at the size of the slice. It contains the 100 KB of content plus
# some overhead. Again, the overhead is small.
> stat
# Using the same parameters we used to create the slice, plus the same
# hash we got above from the full encoding, decode the slice.
> bao
# Confirm that the decoded output matches the corresponding section from
# the input file. (Note that `tail` numbers bytes starting with 1.)
> tail --bytes=+500001 |
> cmp
# Now try decoding the slice with the wrong hash. Again, this will fail,
# as it would if we corrupted some bytes in the slice.
> bao
Outboard mode
By default, all of the operations above work with a "combined" encoded
file, that is, one that contains both the content bytes and the tree
hash bytes interleaved. However, sometimes you want to keep them
separate, for example to avoid duplicating a very large input file. In
these cases, you can use the "outboard" encoding format, via the
--outboard flag:
# Re-encode the input file from above in the outboard mode.
> bao
# Compare the size of all these files. The size of the outboard file is
# equal to the overhead of the original combined file.
> stat |
# Decode the whole file in outboard mode. Note that both the original
# input file and the outboard encoding are passed in as arguments.
> cmp
Installing and building from source
The bao command line utility is published on
crates.io as the
bao_bin crate. To install it, add
~/.cargo/bin to your PATH and then run:
To build the binary directly from this repo:
tests/bao.py is a fully functional second
implementation in Python, designed to be as short and readable as
possible. It's a good starting point for understanding the algorithms
involved, before diving into the Rust code.
The bao library crate includes no_std support if you set
default-features = false in your Cargo.toml. Currently only the
hash module is available, with Rayon-based multithreading disabled.
The encode and decode modules currently depend on the
std::io::{Read, Write, Seek} traits. Those traits are all they need
from std, and they do not allocate, so they could be made
no_std-compatible in the future with some compatibility layer.

