httlib_huffman/encoder/
mod.rs

1//! Provides an implementation of the [canonical Huffman] encoder.
2//! 
3//! Encoding is relatively easy since we are replacing the individual characters
4//! with the Huffman code. We add an EOS sign at the end to always fill the
5//! entire octet.
6//! 
7//! The Huffman encoder implementation illustration:
8//! 
9//! ```txt
10//! [add "!"]     1111111000
11//! [add "$"]     11111110001111111111001
12//! [add "%"]     11111110001111111111001010101 (fix length)
13//! [add "&"]     1111111000111111111100101010111111000
14//! [add "A"]     1111111000111111111100101010111111000100001
15//! [add EOS]     1111111000111111111100101010111111000100001111111111111111111111111111111
16//! 
17//! [result]      [254   ][63    ][242   ][175   ][196   ][63    ]
18//!               111111100011111111110010101011111100010000111111
19//! ```
20//! 
21//! The illustration shows how the encoder iterates through all the [ASCII]
22//! characters and replaces them with the Huffman code. Each line ends with the
23//! EOS character which serves as (up to 7 bits) padding.
24//! 
25//! While adding the Hoffman code to the sequence, the length of the added code
26//! must exactly match the number of bits specified in the documentation.
27//! Working with Huffman codes in bytes and then converting them to other types,
28//! such as strings, could remove the prepended zeros. In such cases, we have to
29//! do some plumbing to ensure all bits are there (an example of this would be
30//! the character "%").
31//! 
32//! Implementation could be achieved by manipulating a string of ones and zeros.
33//! However, for more complex systems such as high-performance web servers, this
34//! would not be sustainable from the performance perspective. To manage
35//! resources accordingly, we require innovation so the investments are
36//! protected.
37//! 
38//! A replacement of the string with characters such as numbers, which are more
39//! appropriate for computers, and the use of bitwise operators gives a
40//! significant increase in performance. Before this can be done, we need to
41//! have an understanding of how the numbers are added. Although we are all
42//! aware of what "1+2=3" is, or what is a concatenation of a string such as
43//! "aa+bb=aabb", in bit operations, these rules are not quite so obvious. Let's
44//! see an example of the addition with bits directly:
45//! 
46//! ```txt
47//!        1 +        2 =        3
48//! 00000001 | 00000010 = 00000011
49//! ```
50//! 
51//! For the sum of two bit numbers, we used the bitwise operator OR denoted by
52//! the "|" symbol which serves as a sign for addition "+" in our example. Its
53//! rule is to trace the bits of both numbers and, if a 0 or a 1 is found on the
54//! same spot, change their value to 1, while setting the value to 0 in other
55//! cases. This understanding now enables us to re-implement the example above.
56//! 
57//! Instead of a string, we will use a u64 data type storing a string of 64
58//! bits. We could also use a data type with a larger capacity (such as u128),
59//! but u64 is sufficient. The storage requirement is 32 bits, which is the
60//! maximum length of the individual Huffman code plus an extra byte (8) for the
61//! surplus cell, meaning that we need 40 bits of storage altogether.
62//! 
63//! The illustration below shows individual steps for encoding a string of
64//! characters as in the example above, while the encoding is carried out with
65//! the use of numbers and bitwise operators.
66//! 
67//! ```txt
68//! [add "!"]     111111100000000000000000000000000000000000000000
69//! [add "$"]     11111110001111111111001000000000000000000000000000000000
70//! [add "%"]     1111111000111111111100101010100000000000000000000000000000000000 (fix length)
71//! [add "&"]               11111111110010101011111100000000000000000000000000000000000000
72//! [add "A"]                     1111001010101111110001000010000000000000000000000000000000000000
73//! [add EOS]                     1111001010101111110001000011111111111111111111111111111110000000
74//! 
75//! [result]      [254   ][63    ][242   ][175   ][196   ][63    ]
76//!               111111100011111111110010101011111100010000111111
77//! ```
78//! 
79//! Although the illustration is quite similar to the previous one, it is much
80//! more colorful. It is also apparent that a string of bits is getting shorter
81//! on the left and longer on the right end.
82//! 
83//! When the Huffman code is added to the string for the individual character,
84//! the algorithm immediately ensures 32 free bit spaces where the next
85//! character will be added. This is achieved by the so-called shifting bits
86//! using the "<<" bitwise operator. Since we are dealing with bit numbers, we
87//! always rotate for 1 or more bytes, dependent on the required capacity,
88//! meaning for 8*N bits. It might not be obvious but it is interesting that,
89//! by rotating bits and adding the new Huffman character, we are adding numbers
90//! in the same way as we did in the simple example previously presented.
91//! 
92//! If looked at separately, the Huffman algorithm is quite simple. But when we
93//! don't intend to only implement it, but we are, instead, interested in the
94//! maximization of the performance and lowering of used resources, things get
95//! more complicated. The performance and quality of our solution are,
96//! therefore, comparable to the implementation in some well-known web servers.
97//! 
98//! [HPACK]: https://tools.ietf.org/html/rfc7541
99//! [HTTP/2]: https://tools.ietf.org/html/rfc7540
100//! [canonical Huffman]: https://en.wikipedia.org/wiki/Canonical_Huffman_code
101mod error;
102pub mod table;
103
104pub use error::*;
105
106/// Encodes the provided `src` bytes and populates the `dst` with the sequance
107/// of Huffman codes.
108/// 
109/// **Example:**
110/// 
111/// ```rust
112/// use httlib_huffman::encode;
113/// 
114/// let mut sequence = Vec::new();
115/// let text = "Hello world!".as_bytes();
116/// encode(&text, &mut sequence).unwrap();
117/// ```
118pub fn encode(src: &[u8], dst: &mut Vec<u8>) -> Result<(), EncoderError> {
119    let mut bits: u64 = 0;
120    let mut bits_left = 40;
121    let codings = self::table::ENCODE_TABLE; // parsed huffman table
122
123    for &byte in src {
124        let (code_len, code) = match codings.get(byte as usize) {
125            Some(coding) => coding,
126            None => return Err(EncoderError::InvalidInput),
127        };
128
129        bits |= (*code as u64) << (bits_left - code_len); // shift and add old and new numbers
130        bits_left -= code_len;
131
132        while bits_left <= 32 {
133            dst.push((bits >> 32) as u8);
134
135            bits <<= 8; // add more room for the next character
136            bits_left += 8;
137        }
138    }
139
140    if bits_left != 40 { // finalize with EOS
141        bits |= (1 << bits_left) - 1; // add EOS and pedding
142        dst.push((bits >> 32) as u8);
143    }
144
145    Ok(())
146}
147
148#[cfg(test)]
149mod test {
150    use super::*;
151
152    /// Should encode ASCII character into Huffman format.
153    #[test]
154    fn encodes_bytes() { 
155        let mut dst = Vec::new();
156
157        encode(b" ", &mut dst).unwrap();
158        encode(b"!", &mut dst).unwrap();
159        encode(b"\"", &mut dst).unwrap();
160        encode(b"#", &mut dst).unwrap();
161        encode(b"$", &mut dst).unwrap();
162        encode(b"%", &mut dst).unwrap();
163        encode(b"&", &mut dst).unwrap();
164        encode(b"'", &mut dst).unwrap();
165        encode(b"(", &mut dst).unwrap();
166        encode(b")", &mut dst).unwrap();
167        encode(b"*", &mut dst).unwrap();
168        encode(b"+", &mut dst).unwrap();
169        encode(b",", &mut dst).unwrap();
170        encode(b"-", &mut dst).unwrap();
171        encode(b".", &mut dst).unwrap();
172        encode(b"/", &mut dst).unwrap();
173        encode(b"0", &mut dst).unwrap();
174        encode(b"1", &mut dst).unwrap();
175        encode(b"2", &mut dst).unwrap();
176        encode(b"3", &mut dst).unwrap();
177        encode(b"4", &mut dst).unwrap();
178        encode(b"5", &mut dst).unwrap();
179        encode(b"6", &mut dst).unwrap();
180        encode(b"7", &mut dst).unwrap();
181        encode(b"8", &mut dst).unwrap();
182        encode(b"9", &mut dst).unwrap();
183        encode(b":", &mut dst).unwrap();
184        encode(b";", &mut dst).unwrap();
185        encode(b"<", &mut dst).unwrap();
186        encode(b"=", &mut dst).unwrap();
187        encode(b">", &mut dst).unwrap();
188        encode(b"?", &mut dst).unwrap();
189        encode(b"@", &mut dst).unwrap();
190        encode(b"A", &mut dst).unwrap();
191        encode(b"B", &mut dst).unwrap();
192        encode(b"C", &mut dst).unwrap();
193        encode(b"D", &mut dst).unwrap();
194        encode(b"E", &mut dst).unwrap();
195        encode(b"F", &mut dst).unwrap();
196        encode(b"G", &mut dst).unwrap();
197        encode(b"H", &mut dst).unwrap();
198        encode(b"I", &mut dst).unwrap();
199        encode(b"J", &mut dst).unwrap();
200        encode(b"K", &mut dst).unwrap();
201        encode(b"L", &mut dst).unwrap();
202        encode(b"M", &mut dst).unwrap();
203        encode(b"N", &mut dst).unwrap();
204        encode(b"O", &mut dst).unwrap();
205        encode(b"P", &mut dst).unwrap();
206        encode(b"Q", &mut dst).unwrap();
207        encode(b"R", &mut dst).unwrap();
208        encode(b"S", &mut dst).unwrap();
209        encode(b"T", &mut dst).unwrap();
210        encode(b"U", &mut dst).unwrap();
211        encode(b"V", &mut dst).unwrap();
212        encode(b"W", &mut dst).unwrap();
213        encode(b"X", &mut dst).unwrap();
214        encode(b"Y", &mut dst).unwrap();
215        encode(b"Z", &mut dst).unwrap();
216        encode(b"[", &mut dst).unwrap();
217        encode(b"\\", &mut dst).unwrap();
218        encode(b"]", &mut dst).unwrap();
219        encode(b"^", &mut dst).unwrap();
220        encode(b"_", &mut dst).unwrap();
221        encode(b"`", &mut dst).unwrap();
222        encode(b"a", &mut dst).unwrap();
223        encode(b"b", &mut dst).unwrap();
224        encode(b"c", &mut dst).unwrap();
225        encode(b"d", &mut dst).unwrap();
226        encode(b"e", &mut dst).unwrap();
227        encode(b"f", &mut dst).unwrap();
228        encode(b"g", &mut dst).unwrap();
229        encode(b"h", &mut dst).unwrap();
230        encode(b"i", &mut dst).unwrap();
231        encode(b"j", &mut dst).unwrap();
232        encode(b"k", &mut dst).unwrap();
233        encode(b"l", &mut dst).unwrap();
234        encode(b"m", &mut dst).unwrap();
235        encode(b"n", &mut dst).unwrap();
236        encode(b"o", &mut dst).unwrap();
237        encode(b"p", &mut dst).unwrap();
238        encode(b"q", &mut dst).unwrap();
239        encode(b"r", &mut dst).unwrap();
240        encode(b"s", &mut dst).unwrap();
241        encode(b"t", &mut dst).unwrap();
242        encode(b"u", &mut dst).unwrap();
243        encode(b"v", &mut dst).unwrap();
244        encode(b"w", &mut dst).unwrap();
245        encode(b"x", &mut dst).unwrap();
246        encode(b"y", &mut dst).unwrap();
247        encode(b"z", &mut dst).unwrap();
248        encode(b"{", &mut dst).unwrap();
249        encode(b"|", &mut dst).unwrap();
250        encode(b"}", &mut dst).unwrap();
251        encode(b"~", &mut dst).unwrap();
252        assert_eq!(dst, vec![
253            83, 254, 63, 254, 127, 255, 175, 255, 207, 87, 248, 255, 95, 254,
254            191, 254, 255, 249, 255, 127, 250, 91, 95, 99, 7, 15, 23, 103, 107,
255            111, 115, 119, 123, 127, 185, 251, 255, 249, 131, 255, 191, 255, 63,
256            255, 215, 135, 187, 189, 191, 193, 195, 197, 199, 201, 203, 205,
257            207, 209, 211, 213, 215, 217, 219, 221, 223, 225, 227, 229, 252,
258            231, 253, 255, 223, 255, 254, 31, 255, 231, 255, 243, 139, 255, 251,
259            31, 143, 39, 147, 47, 151, 155, 159, 55, 233, 235, 163, 167, 171,
260            63, 175, 237, 179, 71, 79, 183, 239, 241, 243, 245, 247, 255, 253,
261            255, 159, 255, 247, 255, 239,
262        ]);
263
264        dst.clear();
265        encode(b"Hello world!", &mut dst).unwrap();
266        assert_eq!(dst, &[
267            198, 90, 40, 58, 158, 15, 101, 18, 127, 31,
268        ]);
269    }
270}