[−][src]Crate unic_segment
UNIC — Unicode Text Segmentation Algorithms
A component of unic
: Unicode and Internationalization Crates for Rust.
This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).
Examples
assert_eq!( Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(), &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"] ); assert_eq!( Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(), &["a", "\r\n", "b", "🇺🇳", "🇮🇨"] ); assert_eq!( GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(), &[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")] ); fn has_alphanumeric(s: &&str) -> bool { s.chars().any(|ch| ch.is_alphanumeric()) } assert_eq!( Words::new( "The quick (\"brown\") fox can't jump 32.3 feet, right?", has_alphanumeric, ).collect::<Vec<&str>>(), &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"] ); assert_eq!( WordBounds::new("The quick (\"brown\") fox").collect::<Vec<&str>>(), &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"] ); assert_eq!( WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(), &[ (0, "Brr"), (3, ","), (4, " "), (5, "it's"), (9, " "), (10, "29.3"), (14, "°"), (16, "F"), (17, "!") ] );
Structs
GraphemeCursor | Cursor-based segmenter for grapheme clusters. |
GraphemeIndices | External iterator for grapheme clusters and byte offsets. |
Graphemes | External iterator for a string's grapheme clusters. |
WordBoundIndices | External iterator for word boundaries and byte offsets. |
WordBounds | External iterator for a string's word boundaries. |
Words | An iterator over the substrings of a string which, after splitting the string on word
boundaries, contain any characters with
the Alphabetic property, or with
|
Enums
GraphemeIncomplete | An error return indicating that not enough content was available in the provided chunk to satisfy the query, and that more content must be provided. |
Constants
PKG_DESCRIPTION | UNIC component description. |
PKG_NAME | UNIC component name. |
PKG_VERSION | UNIC component version. |
UNICODE_VERSION | The Unicode version of data |