[][src]Crate unicode_reader

This crate provides adaptors which wrap byte-oriented readers and yield the UTF-8 data as Unicode code points or grapheme clusters.

Unlike other Unicode parsers which work on strings (e.g. unicode_segmentation upon which this is built), this crate works on streams and doesn't require reading the entire data into memory. Instead it yields the graphemes or code points as it reads them.

Example

extern crate unicode_reader;
use unicode_reader::{CodePoints, Graphemes};

use std::io::Cursor;

fn main() {
    let input = Cursor::new("He\u{302}\u{320}llo");
    let mut graphemes = Graphemes::from(input);
    assert_eq!("H",                 graphemes.next().unwrap().unwrap());
    assert_eq!("e\u{302}\u{320}",   graphemes.next().unwrap().unwrap()); // note 3 characters
    assert_eq!("l",                 graphemes.next().unwrap().unwrap());
    assert_eq!("l",                 graphemes.next().unwrap().unwrap());
    assert_eq!("o",                 graphemes.next().unwrap().unwrap());
    assert!(graphemes.next().is_none());

    let greek_bytes = vec![0xCE, 0xA7, 0xCE, 0xB1, 0xCE, 0xAF, 0xCF, 0x81, 0xCE, 0xB5,
                           0xCF, 0x84, 0xCE, 0xB5];
    let mut codepoints = CodePoints::from(Cursor::new(greek_bytes));
    assert_eq!(vec!['Χ', 'α', 'ί', 'ρ', 'ε', 'τ', 'ε'],
                codepoints.map(|r| r.unwrap())
                          .collect::<Vec<char>>());
}

Repository

Documentation

Structs

BadUtf8Error

An error raised when parsing a UTF-8 byte stream fails.

CodePoints

Wraps a byte-oriented reader and yields the UTF-8 data one code point at a time. Any UTF-8 parsing errors are raised as io::Error with ErrorKind::InvalidData.

Graphemes

Wraps a char-oriented reader and yields the data one Unicode grapheme cluster at a time.