Crate icu_segmenter
source · [−]Expand description
🚧 [Experimental] Segment strings by lines, graphemes, word, and sentences.
This module is published as its own crate (icu_segmenter
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
This module contains segmenter implementation for the following rules.
- Line breaker that is compatible with Unicode Standard Annex #14 and CSS properties.
- Grapheme cluster breaker, word breaker, and sentence breaker that are compatible with Unicode Standard Annex #29.
🚧 This code is experimental; it may change at any time, in breaking or non-breaking ways,
including in SemVer minor releases. It can be enabled with the "experimental" feature
of the icu meta-crate. Use with caution.
#2259
Examples
Line Break
Segment a string with default options:
use icu::segmenter::LineBreakSegmenter;
let segmenter =
LineBreakSegmenter::try_new_unstable(&icu_testdata::unstable())
.expect("Data exists");
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);
See LineBreakSegmenter
for more examples.
Grapheme Cluster Break
See GraphemeClusterBreakSegmenter
for examples.
Word Break
Segment a string:
use icu::segmenter::WordBreakSegmenter;
let segmenter =
WordBreakSegmenter::try_new_unstable(&icu_testdata::unstable())
.expect("Data exists");
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
See WordBreakSegmenter
for more examples.
Sentence Break
See SentenceBreakSegmenter
for examples.
Modules
Data provider struct definitions for this ICU4X component.
Structs
Segments a string into grapheme clusters.
Implements the
Iterator
trait over the line break opportunities of the given string. Please
see the examples in LineBreakSegmenter
for its usages.Options to tailor line breaking behavior, such as for CSS.
Supports loading line break data, and creating line break iterators for different string
encodings.
Implements the
Iterator
trait over the segmenter break opportunities of the given string.Supports loading sentence break data, and creating sentence break iterators for different string
encodings.
Supports loading word break data, and creating word break iterators for different string
encodings.
Enums
A list of error outcomes for various operations in the
icu_timezone
crate.An enum specifies the strictness of line-breaking rules. It can be passed as
an argument when creating a line breaker.
A list of error outcomes for various operations in the
icu_timezone
crate.An enum specifies the line break opportunities between letters. It can be
passed as an argument when creating a line breaker.
Type Definitions
Grapheme cluster break iterator for a Latin-1 (8-bit) string.
Grapheme cluster break iterator for a potentially invalid UTF-8 string.
Grapheme cluster break iterator for an
str
(a UTF-8 string).Grapheme cluster break iterator for a UTF-16 string.
Line break iterator for a Latin-1 (8-bit) string.
Line break iterator for a potentially invalid UTF-8 string
Line break iterator for an
str
(a UTF-8 string).Line break iterator for a UTF-16 string.
Sentence break iterator for a Latin-1 (8-bit) string.
Sentence break iterator for potentially invalid UTF-8 strings
Sentence break iterator for an
str
(a UTF-8 string).Sentence break iterator for a UTF-16 string.
Word break iterator for a Latin-1 (8-bit) string.
Word break iterator for a potentially invalid UTF-8 string
Word break iterator for an
str
(a UTF-8 string).Word break iterator for a UTF-16 string.