Bao
Spec — Rust Crate — Rust Docs
Caution: Bao is intended to be a cryptographic hash function, but it hasn't yet been reviewed. The output may change prior to the 1.0 release.
Bao (rhymes with bough 🌳) is a general purpose tree hash for files. Here's the full specification. What makes a tree hash different from a regular hash? Depending on how many cores you've got in your machine, the first thing you might notice is that it's ten times faster than SHA-2:
Why is bao hash
so fast? It's mostly parallelism, multiple threads
working on different subtrees at the same time. The demo above is from
the 4-core i5-8250U processor in my laptop, but Bao can scale much
higher. In-memory benchmarks on a 48-core AWS m5.24xlarge instance hit
91 GB/s of throughput. Bao is also based on
BLAKE2b, which was designed to outperform SHA-1,
and it uses the fastest SIMD implementation
available.
Encoded files
Apart from parallelism, tree hashes make it possible to verify a file piece-by-piece rather than all-at-once. This is done by storing both the input and the entire hash tree together in an encoded file.
Use case: A cryptographic messaging app might want to add support for attachments, like large video files. If the message metadata includes the Bao hash of its attachment, the client can stream stream an attached video without compromising its immutability. (This problem was in fact the original inspiration for the Bao project.)
# Create an input file that's a megabyte of random data.
> head
# Convert it into a Bao encoded file.
> bao
# Compare the size of the two files. The encoding overhead is small.
> stat |
# Note that the `bao hash` of the input file is the same as the
# `bao hash --encoded` of the encoded file, but the latter is faster.
> bao
> bao
> hash=
# Stream decoded bytes from the encoded file, using the hash above.
> cmp
# Observe that using the wrong hash to decode results in an error. This
# is also what will happen if we use the right hash but corrupt some
# bytes in the encoded file.
> bad_hash=
> cmp
Encoded slices
For situations where you need to serve some verifiable bytes from the middle of a file, without forcing the recipient to stream whole thing from the front, you can extract an encoded slice.
Use case: A BitTorrent-like application could fetch different slices of a file from different peers. Or, a distributed file storage application could request random slices of an archived file from its storage providers, to prove that they're honestly storing the file.
# Using the encoded file from above, extract a 100 KB from somewhere in
# the middle. We'll use start=500000 (500 KB) and count=100000 (100 KB).
> bao
# Look at the size of the slice. It contains the 100 KB of content plus
# some overhead. Again, the overhead is small.
> stat
# Using the same parameters we used to create the slice, plus the same
# hash we got above from the full encoding, decode the slice.
> bao
# Confirm that the decoded output matches the corresponding section from
# the input file. (Note that `tail` numbers bytes starting with 1.)
> tail --bytes=+500001 |
> cmp
# Now try decoding the slice with the wrong hash. Again, this will fail,
# as it would if we corrupted some bytes in the slice.
> bao
Outboard mode
By default, all of the operations above work with a "combined" encoded
file, that is, one that contains both the content bytes and the tree
hash bytes interleaved. However, sometimes you want to keep them
separate, like to avoid copying the contents of a very large input file.
In these cases, you can use the "outboard" encoded format, via the
--outboard
flag:
# Re-encode the input file from above in the outboard mode.
> bao
# Compare the size of all these files. The size of the outboard file is
# equal to the overhead of the original combined file.
> stat |
# Decode the whole file in outboard mode. Note that both the original
# input file and the outboard encoding are passed in as arguments.
> cmp
Installing and building from source
The bao
command line utility is published on
crates.io as the
bao_bin
crate. To install it, add
~/.cargo/bin
to your PATH
and then run:
To build the binary directly from this repo:
tests/bao.py
is a fully functional second
implementation in Python, designed to be as short and readable as
possible. It's a good starting point for understanding the algorithms
involved, before diving into the Rust code.
The bao
library crate includes no_std
support if you set
default-features = false
in your Cargo.toml
. Most of the standalone
functions that don't obviously depend on std
are available. For
example, bao::encode::encode
is available with a single threaded
implementation, but bao::encode::encode_to_vec
isn't available. Of the
streaming implementations, only hash::Writer
is available, because the
encoding and decoding implementations rely more on the std::io::{Read, Write, Seek}
interfaces. If there are any callers that want to do
streaming encoding or decoding under no_std
, please let me know, and
we can figure out which libcore-compatible traits it makes sense to
implement.