Expand description
Utilities for encoding bytes/UTF-8 text into ring elements.
This module provides two related but distinct encodings for turning a byte
string into a sequence of RingElements with a given modulus:
-
Bit-packing (
encode_bytes/decode_bytes): Packs the input bitstream into chunks of b = floor(log2(modulus)) bits. Every produced element satisfies value < 2^b <= modulus, so some residues are intentionally unused when the modulus is not a power of two. This format does not include the original byte length. -
Length-delimited base-m transduction (
encode_bytes_base_m_len/decode_bytes_base_m_len): Encodes the byte length and a fixed rANS state as base-modulus digits, then converts the byte stream into base-modulus payload digits using a uniform rANS transducer. Every produced digit satisfies digit < modulus, so it uses all residues. Decoding is length-delimited and ignores trailing elements.
§Target behavior (base-m-len variant)
- Uses all residues: every emitted
RingElementsatisfies value < modulus, with no “[0, 2^b)” restriction. - Near-linear time: amortized O(1) work per input byte, plus a constant-size
header (2 * k digits for a fixed k determined by
modulus). - Same public API: the public
encode_bytes_base_m_len/decode_bytes_base_m_lensignatures andencode_text_base_m_len/decode_text_base_m_lenbehavior are unchanged. - Padding tolerance: appending extra elements at the end (for example, block padding) does not affect decoding; the decoder stops after emitting the byte count from the length prefix.
§Design: uniform rANS as a radix transducer
Treat each input byte as a uniform symbol (frequency = 1 over 256 symbols). The symbol update is:
- encode: x = x * 256 + byte
- decode: byte = x % 256; x = x / 256
Renormalization uses base = modulus, so every emitted digit is a valid ring residue (digit < modulus). The stream stores a fixed-size initial state immediately after the length prefix so decoding can proceed FIFO (left-to-right) and ignore any trailing padding.
§Future work
Investigate non-uniform rANS as a radix transducer to incorporate empirical byte distributions while retaining the base-modulus digit stream interface.
The bitstream in encode_bytes is packed little-endian: earlier bytes
occupy lower bits of the internal buffer and are emitted first.
Important limitations (bit-packing variant):
- The encoding does not include the original byte length. For some parameter choices (notably when b > 8), decoding may yield extra trailing 0x00 bytes that come from zero-padding the final partial chunk. If you need exact round-trips, store the original length separately and truncate after decoding.
- Decoding assumes elements are in the canonical range produced by this module (each element value fits in b bits). If you modify elements via ring arithmetic, decoding will generally not recover the original bytes.
§References (ANS / rANS)
- J. Duda, “Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding”, https://arxiv.org/abs/1311.2540
- F. Giesen, “rANS notes”: https://fgiesen.wordpress.com/2014/02/02/rans-notes/
- F. Giesen, “rANS with static probability distributions”: https://fgiesen.wordpress.com/2014/02/18/rans-with-static-probability-distributions/
- Y. Collet, “Finite State Entropy, a new breed of entropy coders” (FSE/tANS): https://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html
Structs§
Enums§
- Encoding
Error - Errors that can occur while encoding or decoding.
- Utf8
Encoding Type