bunk/lib.rs
1//! Fast and efficient human-readable data encoding!
2//!
3//! Bunk encodes binary data as pronounceable gibberish, somewhat resembling Latin. This is useful when
4//! binary data such as an encryption key is shown to an end-user who might need to manually transfer it.
5//!
6//! Using the default [settings](Settings), a string of 32 bytes gets encoded as:
7//! ```text
8//! atemorni telphocom neideu gepypi forzamar oasal cevanal butthepo aujoate turviy menkais
9//! ```
10//!
11//! Optionally, Bunk can [decorate](Settings::decorate) the encoded string with commas, periods, and sentence
12//! casing to improve readability:
13//! ```text
14//! Atemorni telphocom. Neideu gepypi forzamar oasal cevanal butthepo aujoate turviy, menkais.
15//! ```
16//!
17//!
18//! # Overview
19//!
20//! - It is fast! On my machine, encoding and then decoding a random array of 32 bytes takes an average of
21//! ~0.8µs with the default settings --- allocations and all; no hidden fees.
22//! - It is small! Bunk stores a table of only 256 syllables, each between 1-4 letters (average of 2.47), and
23//! some data structures needed for fast lookup.
24//! - Checksums of variable length can be added to encoded messages to verify data integrity when decoding,
25//! which protects against typos.
26//! - The [maximum word length](Settings::word_len) (in syllables) can be customized.
27//!
28//!
29//! # How it compares to English dictionary encodings
30//!
31//! A popular scheme is to encode binary data as actual English words, which yields results that are more
32//! readable and easier to remember. See [bip39](https://docs.rs/tiny-bip39/) as an example of this. However,
33//! to be efficient in the amount of data a string of words can encode, a _massive_ table of (sometimes
34//! quite long) words must be included --- [bip39](https://docs.rs/tiny-bip39/) uses 2048 words. In addition
35//! to this, some kind of data structure for lookup is also needed, and will likely have to be constructed at
36//! runtime. If this is of no object to your application, use something like
37//! [bip39](https://docs.rs/tiny-bip39/) instead!
38//!
39//! Bunk takes a different approach, requiring a table of only 256 1-4 letter syllables, each carrying one
40//! byte of data. This allows Bunk to:
41//! - Take up less memory overall.
42//! - Store data structures needed for fast lookup in static memory instead of having to construct it at
43//! runtime.
44//!
45//!
46//! # Serde
47//!
48//! Enable the `serde` feature and Bunk can be used to serialize/deserialize fields that implement
49//! `AsRef<[u8]>` and `From<Vec<u8>>`:
50//! ```text
51//! #[derive(Serialize, Deserialize)]
52//! struct Vault {
53//! #[serde(with = "bunk")]
54//! key: Vec<u8>,
55//! name: String,
56//! }
57//! ```
58//!
59//! Note that the [settings](Settings) used when encoding for serde are necessarily hard-coded:
60//! ```no_run
61//! # use bunk::*;
62//! # let _ =
63//! Settings {
64//! word_len: Some(3),
65//! checksum: Checksum::Disabled,
66//! decorate: false,
67//! }
68//! # ;
69//! ```
70//!
71//!
72//! # Examples
73//!
74//! Basic usage with default [settings](Settings):
75//! ```
76//! let encoded = bunk::encode(b"aftersun");
77//! let decoded = bunk::decode(encoded)?;
78//!
79//! assert_eq!(decoded, b"aftersun");
80//! # Ok::<(), bunk::InvalidData>(())
81//! ```
82//!
83//! Disabled [checksum](Checksum):
84//! ```
85//! use bunk::{Checksum, Settings};
86//!
87//! let settings = Settings {
88//! checksum: Checksum::Disabled,
89//! ..Default::default()
90//! };
91//! let encoded = bunk::encode_with_settings(b"it's such a beautiful day", settings);
92//! let decoded = bunk::decode_with_settings(encoded, settings.checksum)?;
93//!
94//! assert_eq!(decoded, b"it's such a beautiful day");
95//! # Ok::<(), bunk::InvalidData>(())
96//! ```
97//!
98//! Custom [checksum length](Checksum):
99//! ```
100//! use bunk::{Checksum, Settings};
101//!
102//! let settings = Settings {
103//! checksum: Checksum::Length4,
104//! ..Default::default()
105//! };
106//! let encoded = bunk::encode_with_settings([33, 14, 224, 134], settings);
107//! let decoded = bunk::decode_with_settings(encoded, settings.checksum)?;
108//!
109//! assert_eq!(decoded, [33, 14, 224, 134]);
110//! # Ok::<(), bunk::InvalidData>(())
111//! ```
112//!
113//! Custom [word length limit](Settings::word_len):
114//! ```
115//! use bunk::{Checksum, Settings};
116//!
117//! let settings = Settings {
118//! word_len: Some(5),
119//! ..Default::default()
120//! };
121//! let encoded = bunk::encode_with_settings([231, 6, 39, 34], settings);
122//! let decoded = bunk::decode(encoded)?; // word_len doesn't affect the decoder
123//!
124//! assert_eq!(decoded, [231, 6, 39, 34]);
125//! # Ok::<(), bunk::InvalidData>(())
126//! ```
127//!
128//!
129//! # How it works
130//!
131//! To explain the algorithm, we'll iteratively build upon it and solve issues as we go.
132//!
133//! The fundamental idea is to encode a byte as a syllable by using it to index into a table of 256 unique
134//! syllables, the result of which is then appended to the encoded string --- as one would expect. The
135//! decoder can then use a [trie](https://en.wikipedia.org/wiki/Trie) to find the index of the longest
136//! syllable at the beginning of the string, which corresponds to the encoded byte.
137//!
138//! This by itself causes issues of parser ambiguity when one valid syllable is a prefix of another. Take as
139//! a basic example the encoded string "ous". Is this the single syllable "ous", or the syllable "o" followed
140//! by "us"? Barring some cumbersome machinery, there is no way for the decoder to know! The encoder
141//! therefore has to detect when such an ambiguity is possible by checking if the first letter of the second
142//! syllable is a valid continuation of the first syllable. If so, it inserts a word break between them.
143//! (Technically, this is stricter than necessary for breaking the ambiguity but is easy to check and allows
144//! the decoder to be written greedily.)
145//!
146//! To support these two required operations --- finding the longest syllable prefixed to a string, and
147//! checking whether a letter is a valid continuation of a syllable --- Bunk uses a trie. There are then two
148//! issues presenting themselves:
149//! - Tries are _slow_ to construct.
150//! - There are (somehow) no efficient trie libraries for Rust that allows for these operations in their API.
151//!
152//! As a solution to both of these, a precomputed trie (as created by [crawdad](https://docs.rs/crawdad/)) is
153//! stored in static memory, on top of which Bunk implements a basic traversal, which the only API needed for
154//! the two operations. All in all, the trie API comes out to only about 60 lines of code --- much less than
155//! having to add [crawdad](https://docs.rs/crawdad/) (or such) as a dependency.
156//!
157//! So far, the algorithm we've described is a perfectly functional encoder. However, to be more
158//! user-friendly, we'd ideally also like _all_ inputs to yield equally pronounceable text. Without any
159//! further measures, inputs such as `[0, 0, 0, 0]` yield repeated syllables, in this case "uuu u". To avoid
160//! this, Bunk artificially increases the _apparent_ entropy of encoded bytes by first XORing them with a
161//! value dependent on their index. Since XOR undoes itself, the decoder can then do the exact same thing and
162//! retrieve the original bytes. With this in place, `[0, 0, 0, 0]` gets nicely encoded as "trirori mulry".
163
164mod encode;
165mod decode;
166mod syllables;
167mod serde;
168
169pub use encode::*;
170pub use decode::*;
171
172#[cfg(feature = "serde")]
173pub use serde::*;
174
175/// Specifies the number of checksum bytes used when encoding.
176///
177/// Default: [`Checksum::Length1`].
178#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord)]
179pub enum Checksum {
180 /// No bytes used; the encoded data will not contain a checksum.
181 Disabled,
182 /// One byte used.
183 Length1,
184 /// Two bytes used.
185 Length2,
186 /// Three bytes used.
187 Length3,
188 /// Four bytes used.
189 Length4,
190}
191
192impl Checksum {
193 /// Returns the number of checksum bytes to be included in encoded data.
194 const fn len(self) -> usize {
195 self as usize
196 }
197}
198
199impl Default for Checksum {
200 fn default() -> Self {
201 Checksum::Length1
202 }
203}
204
205/// The FNV-1a hashing algorithm.
206///
207/// Implementation based on pseudo-code on
208/// [Wikipedia](https://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function). This is used for the checksum.
209#[derive(Clone, Copy)]
210struct Fnv1a(u32);
211
212impl Fnv1a {
213 /// Creates a hasher initialised with the FNV offset basis.
214 const fn new() -> Fnv1a {
215 Fnv1a(0x811c9dc5)
216 }
217
218 /// Digests one byte.
219 fn update(&mut self, byte: u8) {
220 self.0 ^= byte as u32;
221 self.0 = self.0.wrapping_mul(0x01000193);
222 }
223
224 /// Returns the bytes to be used as checksum.
225 const fn bytes(&self) -> [u8; 4] {
226 self.0.to_le_bytes()
227 }
228}
229
230/// Increases _apparent_ entropy in input data.
231///
232/// Before getting the syllable corresponding to a byte, it along with its index is run through this function
233/// to reduce visible patterns in the input data. This ensures that e.g. `[0, 0, 0, 0]` gets encoded as
234/// `trirori mul` and not `uuu u`.
235///
236/// Some notes:
237/// - This neither increases nor decreases security; it is completely transparent, and used only to make the
238/// output look nicer.
239/// - The transformation applied to bytes repeats every 256 indices.
240/// - This function undoes itself if the index is the same; i.e., it both encodes and decodes bytes.
241///
242/// ```ignore
243/// let input = 0xC5;
244/// let encoded = running_code(input, 0);
245/// let decoded = running_code(encoded, 0);
246/// assert_eq!(input, decoded)
247/// ```
248fn running_code(byte: u8, index: usize) -> u8 {
249 const TABLE: [u8; 256] = include!("../static/entropy.txt");
250 byte ^ TABLE[index & 0xFF]
251}
252
253#[cfg(test)]
254mod tests {
255 use rand::{rngs::SmallRng, RngCore, SeedableRng};
256 use crate::*;
257
258 fn round_trip(data: &[u8], settings: Settings) {
259 let encoded = super::encode_with_settings(data, settings);
260 let decoded = super::decode_with_settings(&encoded, settings.checksum);
261 assert_eq!(decoded.as_deref(), Ok(data), "{data:?}, {settings:?}");
262 }
263
264 fn stress(n: usize) {
265 let checksums = [
266 Checksum::Disabled,
267 Checksum::Length1,
268 Checksum::Length2,
269 Checksum::Length3,
270 Checksum::Length4,
271 ];
272 let max_words = [None, Some(1), Some(2), Some(3), Some(10), Some(11)];
273 let decorates = [true, false];
274 let sizes = [0, 1, 2, 3, 10, 16, 30, 31, 32, 64, 100, 250, 509, 510];
275
276 let stress_settings = |data: &[u8]| {
277 for checksum in checksums {
278 for max_word in max_words {
279 for decorate in decorates {
280 let settings = Settings {
281 checksum,
282 word_len: max_word,
283 decorate,
284 };
285 round_trip(data, settings);
286 }
287 }
288 }
289 };
290 let mut rng = SmallRng::seed_from_u64(7502546294857623797);
291
292 for size in sizes {
293 for _ in 0..n {
294 let mut data = vec![0; size];
295 rng.fill_bytes(&mut data);
296
297 stress_settings(&data);
298 }
299 }
300 }
301
302 #[test]
303 fn stress_medium() {
304 stress(500);
305 }
306}