bunk/
lib.rs

1//! Fast and efficient human-readable data encoding! 
2//! 
3//! Bunk encodes binary data as pronounceable gibberish, somewhat resembling Latin. This is useful when
4//! binary data such as an encryption key is shown to an end-user who might need to manually transfer it. 
5//! 
6//! Using the default [settings](Settings), a string of 32 bytes gets encoded as: 
7//! ```text
8//! atemorni telphocom neideu gepypi forzamar oasal cevanal butthepo aujoate turviy menkais
9//! ```
10//! 
11//! Optionally, Bunk can [decorate](Settings::decorate) the encoded string with commas, periods, and sentence
12//! casing to improve readability: 
13//! ```text
14//! Atemorni telphocom. Neideu gepypi forzamar oasal cevanal butthepo aujoate turviy, menkais.
15//! ```
16//! 
17//! 
18//! # Overview
19//! 
20//! - It is fast! On my machine, encoding and then decoding a random array of 32 bytes takes an average of
21//! ~0.8µs with the default settings --- allocations and all; no hidden fees. 
22//! - It is small! Bunk stores a table of only 256 syllables, each between 1-4 letters (average of 2.47), and
23//! some data structures needed for fast lookup. 
24//! - Checksums of variable length can be added to encoded messages to verify data integrity when decoding,
25//! which protects against typos. 
26//! - The [maximum word length](Settings::word_len) (in syllables) can be customized. 
27//! 
28//! 
29//! # How it compares to English dictionary encodings
30//! 
31//! A popular scheme is to encode binary data as actual English words, which yields results that are more
32//! readable and easier to remember. See [bip39](https://docs.rs/tiny-bip39/) as an example of this. However,
33//! to be efficient in the amount of data a string of words can encode, a _massive_ table of (sometimes
34//! quite long) words must be included --- [bip39](https://docs.rs/tiny-bip39/) uses 2048 words. In addition
35//! to this, some kind of data structure for lookup is also needed, and will likely have to be constructed at
36//! runtime. If this  is of no object to your application, use something like
37//! [bip39](https://docs.rs/tiny-bip39/) instead!
38//! 
39//! Bunk takes a different approach, requiring a table of only 256 1-4 letter syllables, each carrying one
40//! byte of data. This allows Bunk to: 
41//! - Take up less memory overall. 
42//! - Store data structures needed for fast lookup in static memory instead of having to construct it at
43//! runtime. 
44//! 
45//! 
46//! # Serde
47//! 
48//! Enable the `serde` feature and Bunk can be used to serialize/deserialize fields that implement
49//! `AsRef<[u8]>` and `From<Vec<u8>>`: 
50//! ```text
51//! #[derive(Serialize, Deserialize)]
52//! struct Vault {
53//!     #[serde(with = "bunk")]
54//!     key: Vec<u8>, 
55//!     name: String, 
56//! }
57//! ```
58//! 
59//! Note that the [settings](Settings) used when encoding for serde are necessarily hard-coded: 
60//! ```no_run
61//! # use bunk::*;
62//! # let _ =
63//! Settings {
64//!     word_len: Some(3), 
65//!     checksum: Checksum::Disabled, 
66//!     decorate: false, 
67//! }
68//! # ;
69//! ```
70//! 
71//! 
72//! # Examples
73//! 
74//! Basic usage with default [settings](Settings): 
75//! ```
76//! let encoded = bunk::encode(b"aftersun");
77//! let decoded = bunk::decode(encoded)?;
78//! 
79//! assert_eq!(decoded, b"aftersun");
80//! # Ok::<(), bunk::InvalidData>(())
81//! ```
82//! 
83//! Disabled [checksum](Checksum): 
84//! ```
85//! use bunk::{Checksum, Settings};
86//! 
87//! let settings = Settings {
88//!     checksum: Checksum::Disabled, 
89//!     ..Default::default()
90//! };
91//! let encoded = bunk::encode_with_settings(b"it's such a beautiful day", settings);
92//! let decoded = bunk::decode_with_settings(encoded, settings.checksum)?;
93//! 
94//! assert_eq!(decoded, b"it's such a beautiful day");
95//! # Ok::<(), bunk::InvalidData>(())
96//! ```
97//! 
98//! Custom [checksum length](Checksum): 
99//! ```
100//! use bunk::{Checksum, Settings};
101//! 
102//! let settings = Settings {
103//!     checksum: Checksum::Length4, 
104//!     ..Default::default()
105//! };
106//! let encoded = bunk::encode_with_settings([33, 14, 224, 134], settings);
107//! let decoded = bunk::decode_with_settings(encoded, settings.checksum)?;
108//! 
109//! assert_eq!(decoded, [33, 14, 224, 134]);
110//! # Ok::<(), bunk::InvalidData>(())
111//! ```
112//! 
113//! Custom [word length limit](Settings::word_len): 
114//! ```
115//! use bunk::{Checksum, Settings};
116//! 
117//! let settings = Settings {
118//!     word_len: Some(5), 
119//!     ..Default::default()
120//! };
121//! let encoded = bunk::encode_with_settings([231, 6, 39, 34], settings);
122//! let decoded = bunk::decode(encoded)?; // word_len doesn't affect the decoder
123//! 
124//! assert_eq!(decoded, [231, 6, 39, 34]);
125//! # Ok::<(), bunk::InvalidData>(())
126//! ```
127//! 
128//! 
129//! # How it works
130//! 
131//! To explain the algorithm, we'll iteratively build upon it and solve issues as we go. 
132//! 
133//! The fundamental idea is to encode a byte as a syllable by using it to index into a table of 256 unique
134//! syllables, the result of which is then appended to the encoded string --- as one would expect. The
135//! decoder can then use a [trie](https://en.wikipedia.org/wiki/Trie) to find the index of the longest
136//! syllable at the beginning of the string, which corresponds to the encoded byte. 
137//! 
138//! This by itself causes issues of parser ambiguity when one valid syllable is a prefix of another. Take as
139//! a basic example the encoded string "ous". Is this the single syllable "ous", or the syllable "o" followed
140//! by "us"? Barring some cumbersome machinery, there is no way for the decoder to know! The encoder
141//! therefore has to detect when such an ambiguity is possible by checking if the first letter of the second
142//! syllable is a valid continuation of the first syllable. If so, it inserts a word break between them.
143//! (Technically, this is stricter than necessary for breaking the ambiguity but is easy to check and allows
144//! the decoder to be written greedily.)
145//! 
146//! To support these two required operations --- finding the longest syllable prefixed to a string, and
147//! checking whether a letter is a valid continuation of a syllable --- Bunk uses a trie. There are then two
148//! issues presenting themselves: 
149//! - Tries are _slow_ to construct. 
150//! - There are (somehow) no efficient trie libraries for Rust that allows for these operations in their API. 
151//! 
152//! As a solution to both of these, a precomputed trie (as created by [crawdad](https://docs.rs/crawdad/)) is
153//! stored in static memory, on top of which Bunk implements a basic traversal, which the only API needed for
154//! the two operations. All in all, the trie API comes out to only about 60 lines of code --- much less than
155//! having to add [crawdad](https://docs.rs/crawdad/) (or such) as a dependency. 
156//! 
157//! So far, the algorithm we've described is a perfectly functional encoder. However, to be more
158//! user-friendly, we'd ideally also like _all_ inputs to yield equally pronounceable text. Without any
159//! further measures, inputs such as `[0, 0, 0, 0]` yield repeated syllables, in this case "uuu u". To avoid
160//! this, Bunk artificially increases the _apparent_ entropy of encoded bytes by first XORing them with a
161//! value dependent on their index. Since XOR undoes itself, the decoder can then do the exact same thing and
162//! retrieve the original bytes. With this in place, `[0, 0, 0, 0]` gets nicely encoded as "trirori mulry". 
163
164mod encode;
165mod decode;
166mod syllables;
167mod serde;
168
169pub use encode::*;
170pub use decode::*;
171
172#[cfg(feature = "serde")]
173pub use serde::*;
174
175/// Specifies the number of checksum bytes used when encoding. 
176/// 
177/// Default: [`Checksum::Length1`]. 
178#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord)]
179pub enum Checksum {
180    /// No bytes used; the encoded data will not contain a checksum. 
181    Disabled, 
182    /// One byte used. 
183    Length1, 
184    /// Two bytes used. 
185    Length2, 
186    /// Three bytes used. 
187    Length3, 
188    /// Four bytes used. 
189    Length4, 
190}
191
192impl Checksum {
193    /// Returns the number of checksum bytes to be included in encoded data. 
194    const fn len(self) -> usize {
195        self as usize
196    }
197}
198
199impl Default for Checksum {
200    fn default() -> Self {
201        Checksum::Length1
202    }
203}
204
205/// The FNV-1a hashing algorithm. 
206/// 
207/// Implementation based on pseudo-code on
208/// [Wikipedia](https://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function). This is used for the checksum. 
209#[derive(Clone, Copy)]
210struct Fnv1a(u32);
211
212impl Fnv1a {
213    /// Creates a hasher initialised with the FNV offset basis. 
214    const fn new() -> Fnv1a {
215        Fnv1a(0x811c9dc5)
216    }
217
218    /// Digests one byte. 
219    fn update(&mut self, byte: u8) {
220        self.0 ^= byte as u32;
221        self.0 = self.0.wrapping_mul(0x01000193);
222    }
223
224    /// Returns the bytes to be used as checksum. 
225    const fn bytes(&self) -> [u8; 4] {
226        self.0.to_le_bytes()
227    }
228}
229
230/// Increases _apparent_ entropy in input data. 
231/// 
232/// Before getting the syllable corresponding to a byte, it along with its index is run through this function
233/// to reduce visible patterns in the input data. This ensures that e.g. `[0, 0, 0, 0]` gets encoded as
234/// `trirori mul` and not `uuu u`. 
235/// 
236/// Some notes: 
237/// - This neither increases nor decreases security; it is completely transparent, and used only to make the
238/// output look nicer. 
239/// - The transformation applied to bytes repeats every 256 indices. 
240/// - This function undoes itself if the index is the same; i.e., it both encodes and decodes bytes. 
241/// 
242/// ```ignore
243/// let input = 0xC5;
244/// let encoded = running_code(input, 0);
245/// let decoded = running_code(encoded, 0);
246/// assert_eq!(input, decoded)
247/// ```
248fn running_code(byte: u8, index: usize) -> u8 {
249    const TABLE: [u8; 256] = include!("../static/entropy.txt");
250    byte ^ TABLE[index & 0xFF]
251}
252
253#[cfg(test)]
254mod tests {
255    use rand::{rngs::SmallRng, RngCore, SeedableRng};
256    use crate::*;
257
258    fn round_trip(data: &[u8], settings: Settings) {
259        let encoded = super::encode_with_settings(data, settings);
260        let decoded = super::decode_with_settings(&encoded, settings.checksum);
261        assert_eq!(decoded.as_deref(), Ok(data), "{data:?}, {settings:?}");
262    }
263
264    fn stress(n: usize) {
265        let checksums = [
266            Checksum::Disabled, 
267            Checksum::Length1, 
268            Checksum::Length2, 
269            Checksum::Length3, 
270            Checksum::Length4, 
271        ];
272        let max_words = [None, Some(1), Some(2), Some(3), Some(10), Some(11)];
273        let decorates = [true, false];
274        let sizes = [0, 1, 2, 3, 10, 16, 30, 31, 32, 64, 100, 250, 509, 510];
275
276        let stress_settings = |data: &[u8]| {
277            for checksum in checksums {
278                for max_word in max_words {
279                    for decorate in decorates {
280                        let settings = Settings {
281                            checksum, 
282                            word_len: max_word, 
283                            decorate, 
284                        };
285                        round_trip(data, settings);
286                    }
287                }
288            }
289        };
290        let mut rng = SmallRng::seed_from_u64(7502546294857623797);
291
292        for size in sizes {
293            for _ in 0..n {
294                let mut data = vec![0; size];
295                rng.fill_bytes(&mut data);
296                
297                stress_settings(&data);
298            }
299        }
300    }
301
302    #[test]
303    fn stress_medium() {
304        stress(500);
305    }
306}