Expand description

Implementation of UTF 8 / UTF32 converters and converting iterators, including the supporting recognition and translation functions.

Works on a single buffer as well as multiple buffers without needing heap allocation.

An invalid Unicode decoding sequence is replaced with an Unicode Replacement codepoint.

Includes an adapter iterator to filter out Byte Order Mark at the beginning of a stream, and substituting carriage returns with newlines.

utf8conv is dual licensed under the Apache 2.0 license, or the MIT license.

Source Repository: link

Credits attribution of utf8conv is located in source directory doc/utf8conv-credits.md.

Single buffer iterator based parsing
use utf8conv::*;

// Single buffer iterator based UTF8 parsing converting to char
fn utf8_to_char_single_buffer_iterator() {
    let mybuffer = "abc".as_bytes();
    let mut utf8_ref_iter = mybuffer.iter();
    let mut parser = FromUtf8::new();
    let mut iterator = parser.utf8_ref_to_char_with_iter(& mut utf8_ref_iter);
    while let Some(char_val) = iterator.next()  {
        println!("{}", char_val);
        println!("{}", iterator.has_invalid_sequence());
    }
}

// Single buffer iterator based char parsing converting to UTF8
fn char_to_utf8_single_buffer_iterator() {
    let mybuffer = [ '\u{7F}', '\u{80}', '\u{81}', '\u{82}' ];
    let mut char_ref_iter = mybuffer.iter();
    let mut parser = FromUnicode::new();
    let mut iterator = parser.char_ref_to_utf8_with_iter(& mut char_ref_iter);
    while let Some(utf8_val) = iterator.next()  {
        println!("{:#02x}", utf8_val);
        println!("{}", iterator.has_invalid_sequence());
    }
}
Multi-buffer iterator based parsing
use utf8conv::*;

// Multi-buffer iterator based UTF8 parsing converting to char
fn utf8_to_char_multi_buffer_iterator() {
    let mybuffers = ["ab".as_bytes(), "".as_bytes(), "cde".as_bytes()];
    let mut parser = FromUtf8::new();
    for indx in 0 .. mybuffers.len() {
        parser.set_is_last_buffer(indx == mybuffers.len() - 1);
        let mut utf8_ref_iter = mybuffers[indx].iter();
        let mut iterator = parser.utf8_ref_to_char_with_iter(& mut utf8_ref_iter);
        while let Some(char_val) = iterator.next()  {
            println!("{}", char_val);
            println!("{}", iterator.has_invalid_sequence());
        }
    }
}

// Multi-buffer iterator based char parsing converting to UTF8
fn char_to_utf8_multi_buffer_iterator() {
    let mybuffers = [[ '\u{7F}', '\u{80}' ] , [ '\u{81}', '\u{82}' ]];
    let mut parser = FromUnicode::new();
    for indx in 0 .. mybuffers.len() {
        parser.set_is_last_buffer(indx == mybuffers.len() - 1);
        let mut char_ref_iter = mybuffers[indx].iter();
        let mut iterator = parser.char_ref_to_utf8_with_iter(& mut char_ref_iter);
        while let Some(utf8_val) = iterator.next()  {
            println!("{:#02x}", utf8_val);
            println!("{}", iterator.has_invalid_sequence());
        }
    }
}
Single-buffer slice based parsing
use utf8conv::*;

// Single-buffer slice reading based UTF8 parsing converting to char
fn utf8_to_char_single_buffer_slice_reading() {
    let mybuffer = "Wxyz".as_bytes();
    let mut parser = FromUtf8::new();
    let mut cur_slice = mybuffer;
    loop {
        match parser.utf8_to_char(cur_slice) {
            Result::Ok((slice_pos, char_val)) => {
                cur_slice = slice_pos;
                println!("{}", char_val);
                println!("{}", parser.has_invalid_sequence());
            }
            Result::Err(MoreEnum::More(_amt)) => {
                // _amt equals to 0 when end of data
                break;
            }
        }
    }
}

// Single-buffer slice reading based UTF32 parsing converting to UTF8
fn utf32_to_utf8_single_buffer_slice_reading() {
    let mybuffer = [0x7Fu32, 0x80u32, 0x81u32, 0x82u32];
    let mut parser = FromUnicode::new();
    let mut current_slice = & mybuffer[..];
    loop {
        match parser.utf32_to_utf8(current_slice) {
            Result::Ok((slice_pos, utf8_val)) => {
                current_slice = slice_pos;
                println!("{:02x}", utf8_val);
                println!("{}", parser.has_invalid_sequence());
            }
            Result::Err(MoreEnum::More(_amt)) => {
                // _amt equals to 0 when end of data
                break;
            }
        }
    }
}
Multi-buffer slice based parsing
use utf8conv::*;

// Multi-buffer slice reading based UTF8 parsing converting to char
fn utf8_to_char_multi_buffer_slice_reading() {
    let mybuffers = ["Wx".as_bytes(), "".as_bytes(), "yz".as_bytes()];
    let mut parser = FromUtf8::new();
    for indx in 0 .. mybuffers.len() {
        parser.set_is_last_buffer(indx == mybuffers.len() - 1);
        let mut cur_slice = mybuffers[indx];
        loop {
            match parser.utf8_to_char(cur_slice) {
                Result::Ok((slice_pos, char_val)) => {
                    cur_slice = slice_pos;
                    println!("{}", char_val);
                    println!("{}", parser.has_invalid_sequence());
                }
                Result::Err(MoreEnum::More(_amt)) => {
                    // _amt equals to 0 when end of data
                    break;
                }
            }
        }
    }
}

// Multi-buffer slice reading based UTF32 parsing converting to UTF8
fn utf32_to_utf8_multi_buffer_slice_reading() {
    let mybuffers = [[0x7Fu32, 0x80u32], [0x81u32, 0x82u32]];
    let mut parser = FromUnicode::new();
    for indx in 0 .. mybuffers.len() {
        parser.set_is_last_buffer(indx == mybuffers.len() - 1);
        let current_array = mybuffers[indx];
        let mut current_slice = & current_array[..];
        loop {
            match parser.utf32_to_utf8(current_slice) {
                Result::Ok((slice_pos, utf8_val)) => {
                    current_slice = slice_pos;
                    println!("{:02x}", utf8_val);
                    println!("{}", parser.has_invalid_sequence());
                }
                Result::Err(MoreEnum::More(_amt)) => {
                    // _amt equals to 0 when end of data
                    break;
                }
            }
        }
    }
}

Structs

adapter iterator converting from a char reference iterator to an UTF8 iterator (This iterator contains a mutable borrow to the launching FromUnicode object while this iterator is alive.)

This is an implementation of a double-ended buffer containing byte values with storage size of 8. Single threaded usage is intended.

Provides conversion functions from char or UTF32 to UTF8

Provides conversion functions from UTF8 to char or UTF32

adapter iterator converting from an UTF8 iterator to a char iterator (This iterator contains a mutable borrow to the launching FromUtf8 object while this iterator is alive.)

adapter iterator converting from an UTF8 reference iterator to char iterator (This iterator contains a mutable borrow to the launching FromUtf8 object while this iterator is alive.)

adapter iterator converting from an UTF32 iterator to an UTF8 iterator (This iterator contains a mutable borrow to the launching FromUnicode object while this iterator is alive.)

Enums

Indication for needing more data when parameter value greater than 0, or end of data condition when parameter value is 0.

Utf8EndEnum is the result container for the UTF8 to char finite state machine.

Indication for the type of UTF8 decoding when converting from UTF32 to UTF8

Constants

byte 1 of replacement char in UTF8

byte 2 of replacement char in UTF8

byte 3 of replacement char in UTF8

replacement character (UTF32)

Traits

Common operations for UTF conversion parsers

Functions

Function char_iter_to_utf32_iter() takes a mutable reference to a char iterator, and return a UTF32 iterator in its place.

Function char_ref_iter_to_char_iter() takes a mutable reference to a char reference iterator, and return a char iterator in its place.

Classify an UTF32 value into the type of UTF8 it belongs.

Function filter_bom_and_cr_iter() takes a mutable reference to a char iterator, and return a filtered char iterator in its place.

Decode from UTF8 to Unicode code point using a finate state machine.

Function utf8_ref_iter_to_utf8_iter() takes a mutable reference to a UTF8 reference iterator, and return a UTF8 iterator in its place.

Function utf32_ref_iter_to_utf32_iter() takes a mutable reference to a UTF32 reference iterator, and return a UTF32 iterator in its place.