utf8-ranges 0.1.3

Convert ranges of Unicode codepoints to UTF-8 byte ranges.


This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte ranges. This is useful when constructing byte based automata from Unicode. Stated differently, this lets one embed UTF-8 decoding as part of one's automaton.

Linux build status

Dual-licensed under MIT or the UNLICENSE.




This shows how to convert a scalar value range (e.g., the basic multilingual plane) to a sequence of byte based character classes.

extern crate utf8_ranges;

use utf8_ranges::Utf8Sequences;

fn main() {
    for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
        println!("{:?}", range);

The output:


These ranges can then be used to build an automaton. Namely:

  1. Every arbitrary sequence of bytes matches exactly one of the sequences of ranges or none of them.
  2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous encodings of surrogate codepoints in UTF-8 cannot match any of the byte ranges above.)