1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
//! Provides an implementation of the [canonical Huffman] encoder. //! //! Encoding is relatively easy since we are replacing the individual characters //! with the Huffman code. We add an EOS sign at the end to always fill the //! entire octet. //! //! The Huffman encoder implementation illustration: //! //! ```txt //! [add "!"] 1111111000 //! [add "$"] 11111110001111111111001 //! [add "%"] 11111110001111111111001010101 (fix length) //! [add "&"] 1111111000111111111100101010111111000 //! [add "A"] 1111111000111111111100101010111111000100001 //! [add EOS] 1111111000111111111100101010111111000100001111111111111111111111111111111 //! //! [result] [254 ][63 ][242 ][175 ][196 ][63 ] //! 111111100011111111110010101011111100010000111111 //! ``` //! //! The illustration shows how the encoder iterates through all the [ASCII] //! characters and replaces them with the Huffman code. Each line ends with the //! EOS character which serves as (up to 7 bits) padding. //! //! While adding the Hoffman code to the sequence, the length of the added code //! must exactly match the number of bits specified in the documentation. //! Working with Huffman codes in bytes and then converting them to other types, //! such as strings, could remove the prepended zeros. In such cases, we have to //! do some plumbing to ensure all bits are there (an example of this would be //! the character "%"). //! //! Implementation could be achieved by manipulating a string of ones and zeros. //! However, for more complex systems such as high-performance web servers, this //! would not be sustainable from the performance perspective. To manage //! resources accordingly, we require innovation so the investments are //! protected. //! //! A replacement of the string with characters such as numbers, which are more //! appropriate for computers, and the use of bitwise operators gives a //! significant increase in performance. Before this can be done, we need to //! have an understanding of how the numbers are added. Although we are all //! aware of what "1+2=3" is, or what is a concatenation of a string such as //! "aa+bb=aabb", in bit operations, these rules are not quite so obvious. Let's //! see an example of the addition with bits directly: //! //! ```txt //! 1 + 2 = 3 //! 00000001 | 00000010 = 00000011 //! ``` //! //! For the sum of two bit numbers, we used the bitwise operator OR denoted by //! the "|" symbol which serves as a sign for addition "+" in our example. Its //! rule is to trace the bits of both numbers and, if a 0 or a 1 is found on the //! same spot, change their value to 1, while setting the value to 0 in other //! cases. This understanding now enables us to re-implement the example above. //! //! Instead of a string, we will use a u64 data type storing a string of 64 //! bits. We could also use a data type with a larger capacity (such as u128), //! but u64 is sufficient. The storage requirement is 32 bits, which is the //! maximum length of the individual Huffman code plus an extra byte (8) for the //! surplus cell, meaning that we need 40 bits of storage altogether. //! //! The illustration below shows individual steps for encoding a string of //! characters as in the example above, while the encoding is carried out with //! the use of numbers and bitwise operators. //! //! ```txt //! [add "!"] 111111100000000000000000000000000000000000000000 //! [add "$"] 11111110001111111111001000000000000000000000000000000000 //! [add "%"] 1111111000111111111100101010100000000000000000000000000000000000 (fix length) //! [add "&"] 11111111110010101011111100000000000000000000000000000000000000 //! [add "A"] 1111001010101111110001000010000000000000000000000000000000000000 //! [add EOS] 1111001010101111110001000011111111111111111111111111111110000000 //! //! [result] [254 ][63 ][242 ][175 ][196 ][63 ] //! 111111100011111111110010101011111100010000111111 //! ``` //! //! Although the illustration is quite similar to the previous one, it is much //! more colorful. It is also apparent that a string of bits is getting shorter //! on the left and longer on the right end. //! //! When the Huffman code is added to the string for the individual character, //! the algorithm immediately ensures 32 free bit spaces where the next //! character will be added. This is achieved by the so-called shifting bits //! using the "<<" bitwise operator. Since we are dealing with bit numbers, we //! always rotate for 1 or more bytes, dependent on the required capacity, //! meaning for 8*N bits. It might not be obvious but it is interesting that, //! by rotating bits and adding the new Huffman character, we are adding numbers //! in the same way as we did in the simple example previously presented. //! //! If looked at separately, the Huffman algorithm is quite simple. But when we //! don't intend to only implement it, but we are, instead, interested in the //! maximization of the performance and lowering of used resources, things get //! more complicated. The performance and quality of our solution are, //! therefore, comparable to the implementation in some well-known web servers. //! //! [HPACK]: https://tools.ietf.org/html/rfc7541 //! [HTTP/2]: https://tools.ietf.org/html/rfc7540 //! [canonical Huffman]: https://en.wikipedia.org/wiki/Canonical_Huffman_code mod error; pub mod table; pub use error::*; /// Encodes the provided `src` bytes and populates the `dst` with the sequance /// of Huffman codes. /// /// **Example:** /// /// ```rust /// use httlib_huffman::encode; /// /// let mut sequence = Vec::new(); /// let text = "Hello world!".as_bytes(); /// encode(&text, &mut sequence).unwrap(); /// ``` pub fn encode(src: &[u8], dst: &mut Vec<u8>) -> Result<(), EncoderError> { let mut bits: u64 = 0; let mut bits_left = 40; let codings = self::table::ENCODE_TABLE; // parsed huffman table for &byte in src { let (code_len, code) = match codings.get(byte as usize) { Some(coding) => coding, None => return Err(EncoderError::InvalidInput), }; bits |= (*code as u64) << (bits_left - code_len); // shift and add old and new numbers bits_left -= code_len; while bits_left <= 32 { dst.push((bits >> 32) as u8); bits <<= 8; // add more room for the next character bits_left += 8; } } if bits_left != 40 { // finalize with EOS bits |= (1 << bits_left) - 1; // add EOS and pedding dst.push((bits >> 32) as u8); } Ok(()) } #[cfg(test)] mod test { use super::*; /// Should encode ASCII character into Huffman format. #[test] fn encodes_bytes() { let mut dst = Vec::new(); encode(b" ", &mut dst).unwrap(); encode(b"!", &mut dst).unwrap(); encode(b"\"", &mut dst).unwrap(); encode(b"#", &mut dst).unwrap(); encode(b"$", &mut dst).unwrap(); encode(b"%", &mut dst).unwrap(); encode(b"&", &mut dst).unwrap(); encode(b"'", &mut dst).unwrap(); encode(b"(", &mut dst).unwrap(); encode(b")", &mut dst).unwrap(); encode(b"*", &mut dst).unwrap(); encode(b"+", &mut dst).unwrap(); encode(b",", &mut dst).unwrap(); encode(b"-", &mut dst).unwrap(); encode(b".", &mut dst).unwrap(); encode(b"/", &mut dst).unwrap(); encode(b"0", &mut dst).unwrap(); encode(b"1", &mut dst).unwrap(); encode(b"2", &mut dst).unwrap(); encode(b"3", &mut dst).unwrap(); encode(b"4", &mut dst).unwrap(); encode(b"5", &mut dst).unwrap(); encode(b"6", &mut dst).unwrap(); encode(b"7", &mut dst).unwrap(); encode(b"8", &mut dst).unwrap(); encode(b"9", &mut dst).unwrap(); encode(b":", &mut dst).unwrap(); encode(b";", &mut dst).unwrap(); encode(b"<", &mut dst).unwrap(); encode(b"=", &mut dst).unwrap(); encode(b">", &mut dst).unwrap(); encode(b"?", &mut dst).unwrap(); encode(b"@", &mut dst).unwrap(); encode(b"A", &mut dst).unwrap(); encode(b"B", &mut dst).unwrap(); encode(b"C", &mut dst).unwrap(); encode(b"D", &mut dst).unwrap(); encode(b"E", &mut dst).unwrap(); encode(b"F", &mut dst).unwrap(); encode(b"G", &mut dst).unwrap(); encode(b"H", &mut dst).unwrap(); encode(b"I", &mut dst).unwrap(); encode(b"J", &mut dst).unwrap(); encode(b"K", &mut dst).unwrap(); encode(b"L", &mut dst).unwrap(); encode(b"M", &mut dst).unwrap(); encode(b"N", &mut dst).unwrap(); encode(b"O", &mut dst).unwrap(); encode(b"P", &mut dst).unwrap(); encode(b"Q", &mut dst).unwrap(); encode(b"R", &mut dst).unwrap(); encode(b"S", &mut dst).unwrap(); encode(b"T", &mut dst).unwrap(); encode(b"U", &mut dst).unwrap(); encode(b"V", &mut dst).unwrap(); encode(b"W", &mut dst).unwrap(); encode(b"X", &mut dst).unwrap(); encode(b"Y", &mut dst).unwrap(); encode(b"Z", &mut dst).unwrap(); encode(b"[", &mut dst).unwrap(); encode(b"\\", &mut dst).unwrap(); encode(b"]", &mut dst).unwrap(); encode(b"^", &mut dst).unwrap(); encode(b"_", &mut dst).unwrap(); encode(b"`", &mut dst).unwrap(); encode(b"a", &mut dst).unwrap(); encode(b"b", &mut dst).unwrap(); encode(b"c", &mut dst).unwrap(); encode(b"d", &mut dst).unwrap(); encode(b"e", &mut dst).unwrap(); encode(b"f", &mut dst).unwrap(); encode(b"g", &mut dst).unwrap(); encode(b"h", &mut dst).unwrap(); encode(b"i", &mut dst).unwrap(); encode(b"j", &mut dst).unwrap(); encode(b"k", &mut dst).unwrap(); encode(b"l", &mut dst).unwrap(); encode(b"m", &mut dst).unwrap(); encode(b"n", &mut dst).unwrap(); encode(b"o", &mut dst).unwrap(); encode(b"p", &mut dst).unwrap(); encode(b"q", &mut dst).unwrap(); encode(b"r", &mut dst).unwrap(); encode(b"s", &mut dst).unwrap(); encode(b"t", &mut dst).unwrap(); encode(b"u", &mut dst).unwrap(); encode(b"v", &mut dst).unwrap(); encode(b"w", &mut dst).unwrap(); encode(b"x", &mut dst).unwrap(); encode(b"y", &mut dst).unwrap(); encode(b"z", &mut dst).unwrap(); encode(b"{", &mut dst).unwrap(); encode(b"|", &mut dst).unwrap(); encode(b"}", &mut dst).unwrap(); encode(b"~", &mut dst).unwrap(); assert_eq!(dst, vec![ 83, 254, 63, 254, 127, 255, 175, 255, 207, 87, 248, 255, 95, 254, 191, 254, 255, 249, 255, 127, 250, 91, 95, 99, 7, 15, 23, 103, 107, 111, 115, 119, 123, 127, 185, 251, 255, 249, 131, 255, 191, 255, 63, 255, 215, 135, 187, 189, 191, 193, 195, 197, 199, 201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221, 223, 225, 227, 229, 252, 231, 253, 255, 223, 255, 254, 31, 255, 231, 255, 243, 139, 255, 251, 31, 143, 39, 147, 47, 151, 155, 159, 55, 233, 235, 163, 167, 171, 63, 175, 237, 179, 71, 79, 183, 239, 241, 243, 245, 247, 255, 253, 255, 159, 255, 247, 255, 239, ]); dst.clear(); encode(b"Hello world!", &mut dst).unwrap(); assert_eq!(dst, &[ 198, 90, 40, 58, 158, 15, 101, 18, 127, 31, ]); } }