Skip to main content

Module utf8

Module utf8 

Source
Expand description

Utilities for encoding bytes/UTF-8 text into ring elements.

This module provides two related but distinct encodings for turning a byte string into a sequence of RingElements with a given modulus:

  • Bit-packing (encode_bytes / decode_bytes): Packs the input bitstream into chunks of b = floor(log2(modulus)) bits. Every produced element satisfies value < 2^b <= modulus, so some residues are intentionally unused when the modulus is not a power of two. This format does not include the original byte length.

  • Length-delimited base-m transduction (encode_bytes_base_m_len / decode_bytes_base_m_len): Encodes the byte length and a fixed rANS state as base-modulus digits, then converts the byte stream into base-modulus payload digits using a uniform rANS transducer. Every produced digit satisfies digit < modulus, so it uses all residues. Decoding is length-delimited and ignores trailing elements.

§Target behavior (base-m-len variant)

  • Uses all residues: every emitted RingElement satisfies value < modulus, with no “[0, 2^b)” restriction.
  • Near-linear time: amortized O(1) work per input byte, plus a constant-size header (2 * k digits for a fixed k determined by modulus).
  • Same public API: the public encode_bytes_base_m_len / decode_bytes_base_m_len signatures and encode_text_base_m_len / decode_text_base_m_len behavior are unchanged.
  • Padding tolerance: appending extra elements at the end (for example, block padding) does not affect decoding; the decoder stops after emitting the byte count from the length prefix.

§Design: uniform rANS as a radix transducer

Treat each input byte as a uniform symbol (frequency = 1 over 256 symbols). The symbol update is:

  • encode: x = x * 256 + byte
  • decode: byte = x % 256; x = x / 256

Renormalization uses base = modulus, so every emitted digit is a valid ring residue (digit < modulus). The stream stores a fixed-size initial state immediately after the length prefix so decoding can proceed FIFO (left-to-right) and ignore any trailing padding.

§Future work

Investigate non-uniform rANS as a radix transducer to incorporate empirical byte distributions while retaining the base-modulus digit stream interface.

The bitstream in encode_bytes is packed little-endian: earlier bytes occupy lower bits of the internal buffer and are emitted first.

Important limitations (bit-packing variant):

  • The encoding does not include the original byte length. For some parameter choices (notably when b > 8), decoding may yield extra trailing 0x00 bytes that come from zero-padding the final partial chunk. If you need exact round-trips, store the original length separately and truncate after decoding.
  • Decoding assumes elements are in the canonical range produced by this module (each element value fits in b bits). If you modify elements via ring arithmetic, decoding will generally not recover the original bytes.

§References (ANS / rANS)

Structs§

Utf8Encoding

Enums§

EncodingError
Errors that can occur while encoding or decoding.
Utf8EncodingType