Crate unicode_segmentation
source ·Expand description
Iterators which split strings on Grapheme Cluster, Word or Sentence boundaries, according to the Unicode Standard Annex #29 rules.
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);
let s = "The quick (\"brown\") fox can't jump 32.3 feet, right?";
let w = s.unicode_words().collect::<Vec<&str>>();
let b: &[_] = &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"];
assert_eq!(w, b);
let s = "The quick (\"brown\") fox";
let w = s.split_word_bounds().collect::<Vec<&str>>();
let b: &[_] = &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", "fox"];
assert_eq!(w, b);
}
no_std
unicode-segmentation does not depend on libstd, so it can be used in crates
with the #![no_std]
attribute.
crates.io
You can use this package in your project by adding the following
to your Cargo.toml
:
[dependencies]
unicode-segmentation = "1.9.0"
Structs
Cursor-based segmenter for grapheme clusters.
External iterator for grapheme clusters and byte offsets.
External iterator for a string’s
grapheme clusters.
External iterator for sentence boundaries and byte offsets.
External iterator for a string’s
sentence boundaries.
External iterator for word boundaries and byte offsets.
External iterator for a string’s
word boundaries.
An iterator over the substrings of a string which, after splitting the string on
sentence boundaries,
contain any characters with the
Alphabetic
property, or with
General_Category=Number.
An iterator over the substrings of a string which, after splitting the string on
word boundaries,
contain any characters with the
Alphabetic
property, or with
General_Category=Number.
This iterator also provides the byte offsets for each substring.
An iterator over the substrings of a string which, after splitting the string on
word boundaries,
contain any characters with the
Alphabetic
property, or with
General_Category=Number.
Enums
An error return indicating that not enough content was available in the
provided chunk to satisfy the query, and that more content must be provided.
Constants
The version of Unicode
that this version of unicode-segmentation is based on.
Traits
Methods for segmenting strings according to
Unicode Standard Annex #29.