httlib_huffman/encoder/mod.rs
1//! Provides an implementation of the [canonical Huffman] encoder.
2//!
3//! Encoding is relatively easy since we are replacing the individual characters
4//! with the Huffman code. We add an EOS sign at the end to always fill the
5//! entire octet.
6//!
7//! The Huffman encoder implementation illustration:
8//!
9//! ```txt
10//! [add "!"] 1111111000
11//! [add "$"] 11111110001111111111001
12//! [add "%"] 11111110001111111111001010101 (fix length)
13//! [add "&"] 1111111000111111111100101010111111000
14//! [add "A"] 1111111000111111111100101010111111000100001
15//! [add EOS] 1111111000111111111100101010111111000100001111111111111111111111111111111
16//!
17//! [result] [254 ][63 ][242 ][175 ][196 ][63 ]
18//! 111111100011111111110010101011111100010000111111
19//! ```
20//!
21//! The illustration shows how the encoder iterates through all the [ASCII]
22//! characters and replaces them with the Huffman code. Each line ends with the
23//! EOS character which serves as (up to 7 bits) padding.
24//!
25//! While adding the Hoffman code to the sequence, the length of the added code
26//! must exactly match the number of bits specified in the documentation.
27//! Working with Huffman codes in bytes and then converting them to other types,
28//! such as strings, could remove the prepended zeros. In such cases, we have to
29//! do some plumbing to ensure all bits are there (an example of this would be
30//! the character "%").
31//!
32//! Implementation could be achieved by manipulating a string of ones and zeros.
33//! However, for more complex systems such as high-performance web servers, this
34//! would not be sustainable from the performance perspective. To manage
35//! resources accordingly, we require innovation so the investments are
36//! protected.
37//!
38//! A replacement of the string with characters such as numbers, which are more
39//! appropriate for computers, and the use of bitwise operators gives a
40//! significant increase in performance. Before this can be done, we need to
41//! have an understanding of how the numbers are added. Although we are all
42//! aware of what "1+2=3" is, or what is a concatenation of a string such as
43//! "aa+bb=aabb", in bit operations, these rules are not quite so obvious. Let's
44//! see an example of the addition with bits directly:
45//!
46//! ```txt
47//! 1 + 2 = 3
48//! 00000001 | 00000010 = 00000011
49//! ```
50//!
51//! For the sum of two bit numbers, we used the bitwise operator OR denoted by
52//! the "|" symbol which serves as a sign for addition "+" in our example. Its
53//! rule is to trace the bits of both numbers and, if a 0 or a 1 is found on the
54//! same spot, change their value to 1, while setting the value to 0 in other
55//! cases. This understanding now enables us to re-implement the example above.
56//!
57//! Instead of a string, we will use a u64 data type storing a string of 64
58//! bits. We could also use a data type with a larger capacity (such as u128),
59//! but u64 is sufficient. The storage requirement is 32 bits, which is the
60//! maximum length of the individual Huffman code plus an extra byte (8) for the
61//! surplus cell, meaning that we need 40 bits of storage altogether.
62//!
63//! The illustration below shows individual steps for encoding a string of
64//! characters as in the example above, while the encoding is carried out with
65//! the use of numbers and bitwise operators.
66//!
67//! ```txt
68//! [add "!"] 111111100000000000000000000000000000000000000000
69//! [add "$"] 11111110001111111111001000000000000000000000000000000000
70//! [add "%"] 1111111000111111111100101010100000000000000000000000000000000000 (fix length)
71//! [add "&"] 11111111110010101011111100000000000000000000000000000000000000
72//! [add "A"] 1111001010101111110001000010000000000000000000000000000000000000
73//! [add EOS] 1111001010101111110001000011111111111111111111111111111110000000
74//!
75//! [result] [254 ][63 ][242 ][175 ][196 ][63 ]
76//! 111111100011111111110010101011111100010000111111
77//! ```
78//!
79//! Although the illustration is quite similar to the previous one, it is much
80//! more colorful. It is also apparent that a string of bits is getting shorter
81//! on the left and longer on the right end.
82//!
83//! When the Huffman code is added to the string for the individual character,
84//! the algorithm immediately ensures 32 free bit spaces where the next
85//! character will be added. This is achieved by the so-called shifting bits
86//! using the "<<" bitwise operator. Since we are dealing with bit numbers, we
87//! always rotate for 1 or more bytes, dependent on the required capacity,
88//! meaning for 8*N bits. It might not be obvious but it is interesting that,
89//! by rotating bits and adding the new Huffman character, we are adding numbers
90//! in the same way as we did in the simple example previously presented.
91//!
92//! If looked at separately, the Huffman algorithm is quite simple. But when we
93//! don't intend to only implement it, but we are, instead, interested in the
94//! maximization of the performance and lowering of used resources, things get
95//! more complicated. The performance and quality of our solution are,
96//! therefore, comparable to the implementation in some well-known web servers.
97//!
98//! [HPACK]: https://tools.ietf.org/html/rfc7541
99//! [HTTP/2]: https://tools.ietf.org/html/rfc7540
100//! [canonical Huffman]: https://en.wikipedia.org/wiki/Canonical_Huffman_code
101mod error;
102pub mod table;
103
104pub use error::*;
105
106/// Encodes the provided `src` bytes and populates the `dst` with the sequance
107/// of Huffman codes.
108///
109/// **Example:**
110///
111/// ```rust
112/// use httlib_huffman::encode;
113///
114/// let mut sequence = Vec::new();
115/// let text = "Hello world!".as_bytes();
116/// encode(&text, &mut sequence).unwrap();
117/// ```
118pub fn encode(src: &[u8], dst: &mut Vec<u8>) -> Result<(), EncoderError> {
119 let mut bits: u64 = 0;
120 let mut bits_left = 40;
121 let codings = self::table::ENCODE_TABLE; // parsed huffman table
122
123 for &byte in src {
124 let (code_len, code) = match codings.get(byte as usize) {
125 Some(coding) => coding,
126 None => return Err(EncoderError::InvalidInput),
127 };
128
129 bits |= (*code as u64) << (bits_left - code_len); // shift and add old and new numbers
130 bits_left -= code_len;
131
132 while bits_left <= 32 {
133 dst.push((bits >> 32) as u8);
134
135 bits <<= 8; // add more room for the next character
136 bits_left += 8;
137 }
138 }
139
140 if bits_left != 40 { // finalize with EOS
141 bits |= (1 << bits_left) - 1; // add EOS and pedding
142 dst.push((bits >> 32) as u8);
143 }
144
145 Ok(())
146}
147
148#[cfg(test)]
149mod test {
150 use super::*;
151
152 /// Should encode ASCII character into Huffman format.
153 #[test]
154 fn encodes_bytes() {
155 let mut dst = Vec::new();
156
157 encode(b" ", &mut dst).unwrap();
158 encode(b"!", &mut dst).unwrap();
159 encode(b"\"", &mut dst).unwrap();
160 encode(b"#", &mut dst).unwrap();
161 encode(b"$", &mut dst).unwrap();
162 encode(b"%", &mut dst).unwrap();
163 encode(b"&", &mut dst).unwrap();
164 encode(b"'", &mut dst).unwrap();
165 encode(b"(", &mut dst).unwrap();
166 encode(b")", &mut dst).unwrap();
167 encode(b"*", &mut dst).unwrap();
168 encode(b"+", &mut dst).unwrap();
169 encode(b",", &mut dst).unwrap();
170 encode(b"-", &mut dst).unwrap();
171 encode(b".", &mut dst).unwrap();
172 encode(b"/", &mut dst).unwrap();
173 encode(b"0", &mut dst).unwrap();
174 encode(b"1", &mut dst).unwrap();
175 encode(b"2", &mut dst).unwrap();
176 encode(b"3", &mut dst).unwrap();
177 encode(b"4", &mut dst).unwrap();
178 encode(b"5", &mut dst).unwrap();
179 encode(b"6", &mut dst).unwrap();
180 encode(b"7", &mut dst).unwrap();
181 encode(b"8", &mut dst).unwrap();
182 encode(b"9", &mut dst).unwrap();
183 encode(b":", &mut dst).unwrap();
184 encode(b";", &mut dst).unwrap();
185 encode(b"<", &mut dst).unwrap();
186 encode(b"=", &mut dst).unwrap();
187 encode(b">", &mut dst).unwrap();
188 encode(b"?", &mut dst).unwrap();
189 encode(b"@", &mut dst).unwrap();
190 encode(b"A", &mut dst).unwrap();
191 encode(b"B", &mut dst).unwrap();
192 encode(b"C", &mut dst).unwrap();
193 encode(b"D", &mut dst).unwrap();
194 encode(b"E", &mut dst).unwrap();
195 encode(b"F", &mut dst).unwrap();
196 encode(b"G", &mut dst).unwrap();
197 encode(b"H", &mut dst).unwrap();
198 encode(b"I", &mut dst).unwrap();
199 encode(b"J", &mut dst).unwrap();
200 encode(b"K", &mut dst).unwrap();
201 encode(b"L", &mut dst).unwrap();
202 encode(b"M", &mut dst).unwrap();
203 encode(b"N", &mut dst).unwrap();
204 encode(b"O", &mut dst).unwrap();
205 encode(b"P", &mut dst).unwrap();
206 encode(b"Q", &mut dst).unwrap();
207 encode(b"R", &mut dst).unwrap();
208 encode(b"S", &mut dst).unwrap();
209 encode(b"T", &mut dst).unwrap();
210 encode(b"U", &mut dst).unwrap();
211 encode(b"V", &mut dst).unwrap();
212 encode(b"W", &mut dst).unwrap();
213 encode(b"X", &mut dst).unwrap();
214 encode(b"Y", &mut dst).unwrap();
215 encode(b"Z", &mut dst).unwrap();
216 encode(b"[", &mut dst).unwrap();
217 encode(b"\\", &mut dst).unwrap();
218 encode(b"]", &mut dst).unwrap();
219 encode(b"^", &mut dst).unwrap();
220 encode(b"_", &mut dst).unwrap();
221 encode(b"`", &mut dst).unwrap();
222 encode(b"a", &mut dst).unwrap();
223 encode(b"b", &mut dst).unwrap();
224 encode(b"c", &mut dst).unwrap();
225 encode(b"d", &mut dst).unwrap();
226 encode(b"e", &mut dst).unwrap();
227 encode(b"f", &mut dst).unwrap();
228 encode(b"g", &mut dst).unwrap();
229 encode(b"h", &mut dst).unwrap();
230 encode(b"i", &mut dst).unwrap();
231 encode(b"j", &mut dst).unwrap();
232 encode(b"k", &mut dst).unwrap();
233 encode(b"l", &mut dst).unwrap();
234 encode(b"m", &mut dst).unwrap();
235 encode(b"n", &mut dst).unwrap();
236 encode(b"o", &mut dst).unwrap();
237 encode(b"p", &mut dst).unwrap();
238 encode(b"q", &mut dst).unwrap();
239 encode(b"r", &mut dst).unwrap();
240 encode(b"s", &mut dst).unwrap();
241 encode(b"t", &mut dst).unwrap();
242 encode(b"u", &mut dst).unwrap();
243 encode(b"v", &mut dst).unwrap();
244 encode(b"w", &mut dst).unwrap();
245 encode(b"x", &mut dst).unwrap();
246 encode(b"y", &mut dst).unwrap();
247 encode(b"z", &mut dst).unwrap();
248 encode(b"{", &mut dst).unwrap();
249 encode(b"|", &mut dst).unwrap();
250 encode(b"}", &mut dst).unwrap();
251 encode(b"~", &mut dst).unwrap();
252 assert_eq!(dst, vec![
253 83, 254, 63, 254, 127, 255, 175, 255, 207, 87, 248, 255, 95, 254,
254 191, 254, 255, 249, 255, 127, 250, 91, 95, 99, 7, 15, 23, 103, 107,
255 111, 115, 119, 123, 127, 185, 251, 255, 249, 131, 255, 191, 255, 63,
256 255, 215, 135, 187, 189, 191, 193, 195, 197, 199, 201, 203, 205,
257 207, 209, 211, 213, 215, 217, 219, 221, 223, 225, 227, 229, 252,
258 231, 253, 255, 223, 255, 254, 31, 255, 231, 255, 243, 139, 255, 251,
259 31, 143, 39, 147, 47, 151, 155, 159, 55, 233, 235, 163, 167, 171,
260 63, 175, 237, 179, 71, 79, 183, 239, 241, 243, 245, 247, 255, 253,
261 255, 159, 255, 247, 255, 239,
262 ]);
263
264 dst.clear();
265 encode(b"Hello world!", &mut dst).unwrap();
266 assert_eq!(dst, &[
267 198, 90, 40, 58, 158, 15, 101, 18, 127, 31,
268 ]);
269 }
270}