Crate icu_segmenter

Source
Expand description

Segment strings by lines, graphemes, words, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

§Examples

§Line Break

Find line break opportunities:

 use icu::segmenter::LineSegmenter;

 let segmenter = LineSegmenter::new_auto();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 6, 13, 17, 23, 29, 36]);

See LineSegmenter for more examples.

§Grapheme Cluster Break

Find all grapheme cluster boundaries:

 use icu::segmenter::GraphemeClusterSegmenter;

 let segmenter = GraphemeClusterSegmenter::new();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[
         0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
         19, 21, 22, 23, 24, 25, 28, 29, 30, 31, 34, 35, 36
     ]
 );

See GraphemeClusterSegmenter for more examples.

§Word Break

Find all word boundaries:

 use icu::segmenter::WordSegmenter;

 let segmenter = WordSegmenter::new_auto();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[0, 5, 6, 11, 12, 13, 16, 17, 22, 23, 28, 29, 35, 36]
 );

See WordSegmenter for more examples.

§Sentence Break

Segment the string into sentences:

 use icu::segmenter::SentenceSegmenter;

 let segmenter = SentenceSegmenter::new();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 13, 36]);

See SentenceSegmenter for more examples.

Re-exports§

pub use SegmenterError as Error;

Modules§

provider
🚧 [Unstable] Data provider struct definitions for this ICU4X component.

Structs§

GraphemeClusterBreakIterator
Implements the Iterator trait over the grapheme cluster boundaries of the given string.
GraphemeClusterSegmenter
Segments a string into grapheme clusters.
LineBreakIterator
Implements the Iterator trait over the line break opportunities of the given string.
LineBreakOptions
Options to tailor line-breaking behavior.
LineSegmenter
Supports loading line break data, and creating line break iterators for different string encodings.
SentenceBreakIterator
Implements the Iterator trait over the sentence boundaries of the given string.
SentenceSegmenter
Supports loading sentence break data, and creating sentence break iterators for different string encodings.
WordBreakIterator
Implements the Iterator trait over the word boundaries of the given string.
WordSegmenter
Supports loading word break data, and creating word break iterators for different string encodings.

Enums§

LineBreakStrictness
An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line segmenter.
LineBreakWordOption
An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line segmenter.
SegmenterError
A list of error outcomes for various operations in this module.
WordType
The word type tag that is returned by WordBreakIterator::word_type().

Type Aliases§

GraphemeClusterBreakIteratorLatin1
Grapheme cluster break iterator for a Latin-1 (8-bit) string.
GraphemeClusterBreakIteratorPotentiallyIllFormedUtf8
Grapheme cluster break iterator for a potentially invalid UTF-8 string.
GraphemeClusterBreakIteratorUtf8
Grapheme cluster break iterator for an str (a UTF-8 string).
GraphemeClusterBreakIteratorUtf16
Grapheme cluster break iterator for a UTF-16 string.
LineBreakIteratorLatin1
Line break iterator for a Latin-1 (8-bit) string.
LineBreakIteratorPotentiallyIllFormedUtf8
Line break iterator for a potentially invalid UTF-8 string.
LineBreakIteratorUtf8
Line break iterator for an str (a UTF-8 string).
LineBreakIteratorUtf16
Line break iterator for a UTF-16 string.
SentenceBreakIteratorLatin1
Sentence break iterator for a Latin-1 (8-bit) string.
SentenceBreakIteratorPotentiallyIllFormedUtf8
Sentence break iterator for a potentially invalid UTF-8 string.
SentenceBreakIteratorUtf8
Sentence break iterator for an str (a UTF-8 string).
SentenceBreakIteratorUtf16
Sentence break iterator for a UTF-16 string.
WordBreakIteratorLatin1
Word break iterator for a Latin-1 (8-bit) string.
WordBreakIteratorPotentiallyIllFormedUtf8
Word break iterator for a potentially invalid UTF-8 string.
WordBreakIteratorUtf8
Word break iterator for an str (a UTF-8 string).
WordBreakIteratorUtf16
Word break iterator for a UTF-16 string.